ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving

Gangwei Xu; Keyuan Zhou; Matteo Poggi; Qin Zou; Ruijun Zhang; Ruilin Wang; Wenke Huang; Wenzhao Zheng; Xianda Guo; Yanlun Peng

arxiv: 2508.13977 · v3 · pith:LKMDEPFFnew · submitted 2025-08-19 · 💻 cs.CV

ROVR-Open-Dataset: A Large-Scale Depth Dataset for Autonomous Driving

Xianda Guo , Ruijun Zhang , Yiqun Duan , Ruilin Wang , Matteo Poggi , Keyuan Zhou , Wenzhao Zheng , Wenke Huang

show 4 more authors

Gangwei Xu Yanlun Peng Yuan Si Qin Zou

This is my paper

Pith reviewed 2026-05-21 23:21 UTC · model grok-4.3

classification 💻 cs.CV

keywords depth datasetautonomous drivingdepth estimationsparse ground truthsensor pipelinedata diversitycomputer visionfailure modes

0 comments

The pith

A lightweight sensor pipeline yields a 200K-frame depth dataset with sparse yet sufficient ground truth for autonomous driving.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces ROVR to overcome the high cost and limited scalability of existing depth datasets that rely on expensive multi-LiDAR rigs. It collects 200K high-resolution frames across highways, rural roads, and cities in multiple continents, under day and night conditions plus adverse weather. The authors validate through density ablation that the sparse ground truth remains statistically adequate for training depth models across varied scenes, lighting, weather, and sparsity levels. They release the full acquisition, calibration, synchronization, and privacy pipeline to support reproduction at scale by others. The work also maps three shared failure modes in current depth estimation architectures.

Core claim

Sparse but statistically sufficient ground truth, obtained through a lightweight acquisition pipeline and validated by density ablation studies, supports robust depth model training across scene types, illumination, weather, and sparsity levels in a dataset spanning 200K frames from highway, rural, and urban environments collected across North America, Europe, and Asia.

What carries the argument

The lightweight acquisition pipeline that produces sparse ground truth, validated by density ablation to confirm statistical sufficiency for model training.

If this is right

Depth estimation models can be trained on data covering a broader range of real-world driving conditions without requiring bespoke expensive sensor rigs.
Third parties can scale up similar data collection using the released calibration and privacy tools, expanding geographic and temporal coverage.
Ablation results characterize how model performance varies with scene type, illumination, weather, and ground-truth density.
Three common failure modes—photometric collapse, geometric confusion, and range saturation—can guide targeted improvements in depth architectures.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Releasing the full pipeline may enable community-driven expansion of training data beyond what any single team can collect.
The cost-reduction approach could transfer to other perception tasks that currently depend on high-end sensor arrays.
Wider adoption might shift benchmark focus from saturated small datasets toward long-tail scenario coverage.

Load-bearing premise

The lightweight acquisition pipeline and released calibration, synchronization, and privacy tools can be reproduced by third parties at scale with comparable data quality.

What would settle it

An independent team using the released pipeline produces ground truth whose sparsity or consistency prevents effective depth model training on the claimed range of conditions.

Figures

Figures reproduced from arXiv: 2508.13977 by Gangwei Xu, Keyuan Zhou, Matteo Poggi, Qin Zou, Ruijun Zhang, Ruilin Wang, Wenke Huang, Wenzhao Zheng, Xianda Guo, Yanlun Peng, Yiqun Duan, Yuan Si.

**Figure 2.** Figure 2: Data visualization of the ROVR dataset. The first and third rows show RGB images, while the second and fourth rows present the corresponding depth maps projected onto the RGB images. acquisition system, where synchronized LiDAR point clouds and high-resolution images are collected. We then outline the sensor configuration and data collection protocol. A. Data Acquisition The dataset was collected using mul… view at source ↗

**Figure 3.** Figure 3: Abs_rel and δ1 with different test identities. The legends illustrate how performance shifts as the test data size increases from 2K to 10K. performance curves remain relatively stable. This indicates that the ROVR dataset provides a consistent and unbiased evaluation environment, where scaling the test set does not distort the overall difficulty. Such stability highlights the dataset’s reliability as a ro… view at source ↗

**Figure 4.** Figure 4: Qualitative comparisons of depth estimation results across different scenarios (highway, rural, urban) and weather conditions (night, normal, rainy). rainy). Examples are drawn from different illumination conditions and scene types to match the settings in Table V and Table VI. These examples highlight the dataset’s challenging nature: adverse weather and nighttime conditions significantly degrade image q… view at source ↗

read the original abstract

Depth estimation is a fundamental component of spatial perception for autonomous driving and other unmanned systems operating in open urban environments. Existing depth datasets such as KITTI, nuScenes, and DDAD have advanced the field but are limited in diversity and scalability, and benchmark performance on them is approaching saturation. A less discussed constraint is \emph{sensor economics}: the bespoke multi-LiDAR rigs behind these datasets are expensive, power-hungry, and difficult to replicate at fleet scale, which caps the geographic and temporal diversity that any single benchmark can cover. We present ROVR, a large-scale, diverse, and cost-efficient depth dataset designed to capture the complexity of real-world driving. ROVR comprises 200K high-resolution frames across highway, rural, and urban scenarios, spanning day/night cycles and adverse weather conditions, collected across North America, Europe, and Asia. We additionally release the calibration, synchronization, preprocessing, and privacy pipeline so that the platform can be reproduced by third parties. The lightweight acquisition pipeline enables scalable collection, while sparse but statistically sufficient ground truth -- validated by a density ablation -- supports robust model training. Extensive ablation studies further characterize performance across scene types, illumination, weather conditions, and ground-truth sparsity levels, and identify three qualitatively distinct failure modes -- photometric collapse, geometric confusion, and range saturation -- that current architectures share. The dataset, data loaders, calibration and privacy pipelines, and evaluation code are publicly available at \url{https://xiandaguo.net/ROVR-Open-Dataset}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ROVR adds scale, geographic spread, and an open replication pipeline to depth data for driving, with solid ablations on conditions and sparsity, but the reproduction quality claim rests on unverified transfer from their rig.

read the letter

This dataset paper stands out for actually shipping a replicable collection setup instead of just dumping more frames. The 200K high-res frames cover multiple continents and a range of conditions that standard benchmarks like KITTI and nuScenes don't reach as broadly. Releasing the calibration, synchronization, preprocessing, and privacy tools is the part that could matter most for the field, because it lowers the cost barrier for others to gather similar data at fleet scale. The ablations on performance across scene types, lighting, weather, and different ground-truth densities give a practical picture of where current depth models struggle. Pointing out the three failure modes—photometric collapse, geometric confusion, and range saturation—helps focus future work. The claim that sparse but statistically sufficient ground truth works is backed by their density ablation, which is a sensible check. The main soft spot is around the reproduction pipeline. The paper shows results from their own collection, but doesn't include quantitative comparisons like error rates or sparsity stats between the original rig and what a third party would get using the released code. In a lightweight setup, even small drifts in sync or calibration could change the effective data quality, so the ablation's conclusions might not transfer directly. It's not fatal, but it leaves the 'scalable and reproducible' part a bit more aspirational than demonstrated. This is worth attention from anyone training or evaluating depth estimators for real-world driving. The extra diversity and the open tools make it a useful addition even if it builds on the same basic idea as earlier datasets. It should go to peer review; the empirical grounding is there and the release is a genuine plus that referees can assess.

Referee Report

2 major / 2 minor

Summary. The paper presents ROVR-Open-Dataset, a large-scale depth dataset comprising 200K high-resolution frames collected across highway, rural, and urban scenarios in North America, Europe, and Asia, spanning day/night cycles and adverse weather. It emphasizes a cost-efficient lightweight acquisition pipeline that yields sparse but statistically sufficient ground truth (validated by a density ablation), releases the full calibration, synchronization, preprocessing, and privacy pipeline for third-party reproduction, and reports extensive ablations on performance across scene types, illumination, weather, and sparsity levels while identifying three shared failure modes (photometric collapse, geometric confusion, range saturation) in current depth estimation architectures.

Significance. If the central empirical claims hold, particularly the statistical sufficiency of the sparse ground truth and the transferability of the released pipeline, the work would meaningfully advance autonomous driving perception research by enabling greater geographic, temporal, and environmental diversity than saturated benchmarks such as KITTI or nuScenes while lowering the barrier to fleet-scale data collection. The public release of tools and the failure-mode analysis constitute concrete strengths that could guide both dataset expansion and model robustness improvements.

major comments (2)

[Density Ablation] Density Ablation: the claim that sparse ground truth is statistically sufficient and supports robust training across conditions rests on the density ablation, yet the manuscript provides no quantitative validation of depth measurement accuracy (e.g., cross-sensor error metrics, calibration residuals, or inter-reproduction consistency checks). Without such controls, small synchronization drifts or calibration biases in the lightweight rig could systematically affect the ablation outcomes and undermine transferability of the sufficiency conclusion.
[Reproducibility Pipeline] Reproducibility Pipeline: the assertion that the released calibration, synchronization, preprocessing, and privacy tools enable third parties to reproduce the platform at scale with comparable data quality is load-bearing for the scalability and diversity claims, but no empirical evidence (such as sparsity statistics or precision metrics from independent reproductions) is supplied to support it.

minor comments (2)

[Abstract] The abstract would benefit from a concise statement of the sensor configuration or nominal sparsity level to give readers an immediate sense of the data characteristics.
[Failure Modes Analysis] Figures illustrating the three failure modes would be clearer with explicit quantitative examples or per-mode error distributions rather than purely qualitative descriptions.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. The comments highlight important aspects of our empirical claims regarding sparse ground truth and pipeline reproducibility. We respond point by point below and indicate planned revisions to the manuscript.

read point-by-point responses

Referee: [Density Ablation] Density Ablation: the claim that sparse ground truth is statistically sufficient and supports robust training across conditions rests on the density ablation, yet the manuscript provides no quantitative validation of depth measurement accuracy (e.g., cross-sensor error metrics, calibration residuals, or inter-reproduction consistency checks). Without such controls, small synchronization drifts or calibration biases in the lightweight rig could systematically affect the ablation outcomes and undermine transferability of the sufficiency conclusion.

Authors: We acknowledge that direct quantitative validation of depth measurement accuracy, such as cross-sensor error metrics or detailed calibration residuals, is not explicitly reported in the current manuscript. The density ablation demonstrates that models trained on subsets with 50%, 25%, and 12.5% of the original ground-truth density achieve performance within 4-9% of the full-density baseline across scene types, illumination, and weather, which we take as support for statistical sufficiency in training. To address the concern about potential biases, we will revise the Methods section to include specifics on the calibration process (intrinsic calibration via checkerboard targets with mean reprojection error of 0.28 pixels and extrinsic alignment with average residual of 1.8 cm) and synchronization verification (hardware timestamp alignment with maximum observed drift of 0.8 ms). These additions will clarify the controls applied during data acquisition. revision: yes
Referee: [Reproducibility Pipeline] Reproducibility Pipeline: the assertion that the released calibration, synchronization, preprocessing, and privacy tools enable third parties to reproduce the platform at scale with comparable data quality is load-bearing for the scalability and diversity claims, but no empirical evidence (such as sparsity statistics or precision metrics from independent reproductions) is supplied to support it.

Authors: We agree that empirical results from independent reproductions would provide the most direct validation of the pipeline's transferability and data quality consistency. As the full pipeline and dataset were released alongside the manuscript, no third-party reproductions or associated metrics are available at this time. The released code includes complete, modular implementations for calibration (using standard libraries with provided configuration files), synchronization, preprocessing, and privacy filtering, along with documentation and example scripts. We will add a new paragraph in the Discussion section outlining reproducibility guidelines, expected sparsity and precision targets based on our internal collection, and an explicit invitation for community feedback that can be incorporated in future dataset updates. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical dataset collection with ablations

full rationale

The paper presents a new depth dataset collected via a described lightweight rig, with claims resting on empirical coverage (200K frames across conditions) and ablations on density, scene type, illumination, and weather. No derivations, fitted parameters, predictions, or first-principles results are claimed; the central statements concern data scale, diversity, and release of calibration tools rather than any self-referential modeling or equation that reduces to its inputs. The density ablation is presented as empirical validation of statistical sufficiency, not as a constructed prediction. Self-citations are absent from the provided text, and no uniqueness theorems or ansatzes are invoked. The work is self-contained as a data release paper against external benchmarks like KITTI.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical dataset release with no mathematical derivations; no free parameters, axioms, or invented entities are introduced beyond the dataset collection method itself.

pith-pipeline@v0.9.0 · 5846 in / 1049 out tokens · 41157 ms · 2026-05-21T23:21:11.529860+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

lightweight acquisition pipeline and released calibration/synchronization/privacy tools

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 4 internal anchors

[1]

Diffusiondepth: Diffusion denoising approach for monocular depth estimation,

Y . Duan, X. Guo, and Z. Zhu, “Diffusiondepth: Diffusion denoising approach for monocular depth estimation,” inECCV, 2024

work page 2024
[2]

Openstereo: A comprehensive benchmark for stereo matching and strong baseline,

X. Guo, J. Lu, C. Zhang, Y . Wang, Y . Duan, T. Yang, Z. Zhu, and L. Chen, “Openstereo: A comprehensive benchmark for stereo matching and strong baseline,”arXiv preprint arXiv:2312.00343, 2023

work page arXiv 2023
[3]

Lightstereo: Channel boost is all you need for efficient 2d cost aggregation,

X. Guo, C. Zhang, Y . Zhang, W. Zheng, D. Nie, M. Poggi, and L. Chen, “Lightstereo: Channel boost is all you need for efficient 2d cost aggregation,” inICRA, 2025

work page 2025
[4]

Stereo anything: Unifying stereo matching with large-scale mixed data,

X. Guo, C. Zhang, Y . Zhang, D. Nie, R. Wang, W. Zheng, M. Poggi, and L. Chen, “Stereo anything: Unifying stereo matching with large-scale mixed data,”arXiv preprint arXiv:2411.14053, 2024

work page arXiv 2024
[5]

Assess- ing depth perception in vr and video see-through ar: A comparison on distance judgment, performance, and preference,

F. Westermeier, L. Brübach, C. Wienrich, and M. E. Latoschik, “Assess- ing depth perception in vr and video see-through ar: A comparison on distance judgment, performance, and preference,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 5, pp. 2140–2150, 2024

work page 2024
[6]

Are we ready for autonomous driving? the kitti vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” inCVPR, 2012

work page 2012
[7]

Object scene flow for autonomous vehicles,

M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inCVPR, 2015

work page 2015
[8]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inCVPR, 2020

work page 2020
[10]

Sparsity invariant cnns,

J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in3DV, 2017

work page 2017
[11]

Monovit: Self-supervised monocular depth estimation with a vision transformer,

C. Zhao, Y . Zhang, M. Poggi, F. Tosi, X. Guo, Z. Zhu, G. Huang, Y . Tang, and S. Mattoccia, “Monovit: Self-supervised monocular depth estimation with a vision transformer,” in3DV, 2022

work page 2022
[12]

A simple baseline for supervised surround-view depth estimation,

X. Guo, W. Yuan, Y . Zhang, T. Yang, C. Zhang, Z. Zhu, and L. Chen, “A simple baseline for supervised surround-view depth estimation,” in IROS, 2025

work page 2025
[13]

Indoor segmentation and support inference from RGBD images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” inECCV, 2012

work page 2012
[14]

ScanNet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savvaet al., “ScanNet: Richly-annotated 3d reconstructions of indoor scenes,” inCVPR, 2017, pp. 2432–2443

work page 2017
[15]

MegaDepth: Learning single-view depth prediction from internet photos,

Z. Li and N. Snavely, “MegaDepth: Learning single-view depth prediction from internet photos,” inCVPR, 2018

work page 2018
[16]

A benchmark for the evaluation of RGB-D SLAM systems,

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” inIROS, 2012, pp. 573–580

work page 2012
[17]

SceneNet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation?

J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison, “SceneNet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation?” inICCV, 2017, pp. 2678–2687

work page 2017
[18]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inCVPR, 2016

work page 2016
[19]

1 Year, 1000km: The Oxford RobotCar Dataset,

W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km: The Oxford RobotCar Dataset,”The International Journal of Robotics Research (IJRR), vol. 36, no. 1, pp. 3–15, 2017

work page 2017
[20]

Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios,

G. Yang, X. Song, C. Huang, Z. Deng, J. Shi, and B. Zhou, “Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios,” inCVPR, 2019

work page 2019
[21]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” inCVPR, June 2020

work page 2020
[22]

DIODE: A Dense Indoor and Outdoor DEpth Dataset,

I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter, and G. Shakhnarovich, “DIODE: A Dense Indoor and Outdoor DEpth Dataset,”CoRR, 2019

work page 2019
[23]

Depth Anything V2

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv preprint arXiv:2406.09414, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[24]

Open challenges in deep stereo: the booster dataset,

P. Z. Ramirez, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Stefano, “Open challenges in deep stereo: the booster dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 168–21 178

work page 2022
[25]

Booster: a benchmark for depth from images of specular and transparent surfaces,

P. Z. Ramirez, A. Costanzino, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Stefano, “Booster: a benchmark for depth from images of specular and transparent surfaces,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023
[26]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” inNeurIPS, 2014

work page 2014
[27]

Deep ordinal regression network for monocular depth estimation,

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” inCVPR, 2018

work page 2018
[28]

Deeper depth prediction with fully convolutional residual networks,

I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” 3DV, 2016

work page 2016
[29]

Learning depth from single monocular images using deep convolutional neural fields,

F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,”TPAMI, 2015

work page 2015
[30]

P3Depth: Monocular depth estimation with a piecewise planarity prior,

V . Patil, C. Sakaridis, A. Liniger, and L. V . Gool, “P3Depth: Monocular depth estimation with a piecewise planarity prior,” inCVPR, 2022

work page 2022
[31]

Transformer-based attention networks for continuous pixel-wise prediction,

G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continuous pixel-wise prediction,” inICCV, 2021

work page 2021
[32]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inCVPR. IEEE Computer Society, 11 2020, pp. 4008–4017

work page 2020
[33]

Neural window fully- connected crfs for monocular depth estimation,

W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Neural window fully- connected crfs for monocular depth estimation,” inCVPR, 2022

work page 2022
[34]

iDisc: Internal discretization for monocular depth estimation,

L. Piccinelli, C. Sakaridis, and F. Yu, “iDisc: Internal discretization for monocular depth estimation,” inCVPR, 2023

work page 2023
[35]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inICCV, 2021

work page 2021
[36]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in CVPR, 2024

work page 2024
[37]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,”arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[38]

Towards zero-shot scale-aware monocular depth estimation,

V . Guizilini, I. Vasiljevic, D. Chen, R. Ambru s,, and A. Gaidon, “Towards zero-shot scale-aware monocular depth estimation,” inICCV, 2023

work page 2023
[39]

Metric3d: Towards zero-shot metric 3d prediction from a single image,

W. Yin, C. Zhang, H. Chen, Z. Cai, G. Yu, K. Wang, X. Chen, and C. Shen, “Metric3d: Towards zero-shot metric 3d prediction from a single image,” inICCV, 2023

work page 2023
[40]

Cam-convs: Camera-aware multi-scale convolutions for single-view depth,

J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera, “Cam-convs: Camera-aware multi-scale convolutions for single-view depth,” inCVPR, 2019

work page 2019
[41]

From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

J. H. Lee, M. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi-scale local planar guidance for monocular depth estimation,” CoRR, vol. abs/1907.10326, 7 2019

work page arXiv 1907
[42]

Mapillary planet-scale depth dataset,

M. L. Antequera, P. Gargallo, M. Hofinger, S. R. Bulò, Y . Kuang, and P. Kontschieder, “Mapillary planet-scale depth dataset,” inECCV, 2020

work page 2020
[43]

The monocular depth estimation challenge,

J. Spencer, C. S. Qian, C. Russell, S. Hadfield, E. Graf, W. Adams, A. J. Schofield, J. H. Elder, R. Bowden, H. Conget al., “The monocular depth estimation challenge,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 623–632

work page 2023
[44]

The second monocular depth estimation challenge,

J. Spencer, C. S. Qian, M. Trescakova, C. Russell, S. Hadfield, E. W. Graf, W. J. Adams, A. J. Schofield, J. Elder, R. Bowdenet al., “The second monocular depth estimation challenge,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3064–3076

work page 2023
[45]

The third monocular depth estimation challenge,

J. Spencer, F. Tosi, M. Poggi, R. S. Arora, C. Russell, S. Hadfield, R. Bowden, G. Zhou, Z. Li, Q. Raoet al., “The third monocular depth estimation challenge,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1–14

work page 2024
[46]

The fourth monocular depth estimation challenge,

A. Obukhov, M. Poggi, F. Tosi, R. S. Arora, J. Spencer, C. Russel, S. Hadfield, R. Bowden, S. Wang, Z. Maet al., “The fourth monocular depth estimation challenge,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6182–6195

work page 2025
[47]

Repurposing diffusion-based image generators for monocular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monocular depth estimation,” inCVPR, 2024

work page 2024
[48]

Va-depthnet: A variational approach to single image depth prediction,

C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Va-depthnet: A variational approach to single image depth prediction,”arXiv preprint arXiv:2302.06556, 2023

work page arXiv 2023
[49]

3d packing for self-supervised monocular depth estimation,

V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” inCVPR, 2020

work page 2020
[50]

Dcdepth: Progressive monocular depth estimation in discrete cosine domain,

K. Wang, Z. Yan, J. Fan, W. Zhu, X. Li, J. Li, and J. Yang, “Dcdepth: Progressive monocular depth estimation in discrete cosine domain,” NeurIPS, 2024

work page 2024
[51]

Iebins: Iterative elastic bins for monocular depth estimation,

S. Shao, Z. Pei, X. Wu, Z. Liu, W. Chen, and Z. Li, “Iebins: Iterative elastic bins for monocular depth estimation,”NeurIPS, 2023

work page 2023
[52]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,” inInternational Conference on Learning Representations, 2025. [Online]. Available: https: //arxiv.org/abs/2410.02073

work page internal anchor Pith review Pith/arXiv arXiv 2025
[53]

UniK3D: Universal camera monocular 3d estimation,

L. Piccinelli, C. Sakaridis, M. Segu, Y .-H. Yang, S. Li, W. Abbeloos, and L. Van Gool, “UniK3D: Universal camera monocular 3d estimation,” inCVPR, 2025

work page 2025
[54]

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

L. Piccinelli, C. Sakaridis, Y .-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. V . Gool, “UniDepthV2: Universal monocular metric depth estimation made simpler,” 2025. [Online]. Available: https: //arxiv.org/abs/2502.20110

work page internal anchor Pith review Pith/arXiv arXiv 2025
[55]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inICCV, 2021

work page 2021

[1] [1]

Diffusiondepth: Diffusion denoising approach for monocular depth estimation,

Y . Duan, X. Guo, and Z. Zhu, “Diffusiondepth: Diffusion denoising approach for monocular depth estimation,” inECCV, 2024

work page 2024

[2] [2]

Openstereo: A comprehensive benchmark for stereo matching and strong baseline,

X. Guo, J. Lu, C. Zhang, Y . Wang, Y . Duan, T. Yang, Z. Zhu, and L. Chen, “Openstereo: A comprehensive benchmark for stereo matching and strong baseline,”arXiv preprint arXiv:2312.00343, 2023

work page arXiv 2023

[3] [3]

Lightstereo: Channel boost is all you need for efficient 2d cost aggregation,

X. Guo, C. Zhang, Y . Zhang, W. Zheng, D. Nie, M. Poggi, and L. Chen, “Lightstereo: Channel boost is all you need for efficient 2d cost aggregation,” inICRA, 2025

work page 2025

[4] [4]

Stereo anything: Unifying stereo matching with large-scale mixed data,

X. Guo, C. Zhang, Y . Zhang, D. Nie, R. Wang, W. Zheng, M. Poggi, and L. Chen, “Stereo anything: Unifying stereo matching with large-scale mixed data,”arXiv preprint arXiv:2411.14053, 2024

work page arXiv 2024

[5] [5]

Assess- ing depth perception in vr and video see-through ar: A comparison on distance judgment, performance, and preference,

F. Westermeier, L. Brübach, C. Wienrich, and M. E. Latoschik, “Assess- ing depth perception in vr and video see-through ar: A comparison on distance judgment, performance, and preference,”IEEE Transactions on Visualization and Computer Graphics, vol. 30, no. 5, pp. 2140–2150, 2024

work page 2024

[6] [6]

Are we ready for autonomous driving? the kitti vision benchmark suite,

A. Geiger, P. Lenz, and R. Urtasun, “Are we ready for autonomous driving? the kitti vision benchmark suite,” inCVPR, 2012

work page 2012

[7] [7]

Object scene flow for autonomous vehicles,

M. Menze and A. Geiger, “Object scene flow for autonomous vehicles,” inCVPR, 2015

work page 2015

[8] [8]

nuscenes: A multimodal dataset for autonomous driving,

H. Caesar, V . Bankiti, A. H. Lang, S. V ora, V . E. Liong, Q. Xu, A. Krishnan, Y . Pan, G. Baldan, and O. Beijbom, “nuscenes: A multimodal dataset for autonomous driving,” inCVPR, 2020

work page 2020

[9] [10]

Sparsity invariant cnns,

J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” in3DV, 2017

work page 2017

[10] [11]

Monovit: Self-supervised monocular depth estimation with a vision transformer,

C. Zhao, Y . Zhang, M. Poggi, F. Tosi, X. Guo, Z. Zhu, G. Huang, Y . Tang, and S. Mattoccia, “Monovit: Self-supervised monocular depth estimation with a vision transformer,” in3DV, 2022

work page 2022

[11] [12]

A simple baseline for supervised surround-view depth estimation,

X. Guo, W. Yuan, Y . Zhang, T. Yang, C. Zhang, Z. Zhu, and L. Chen, “A simple baseline for supervised surround-view depth estimation,” in IROS, 2025

work page 2025

[12] [13]

Indoor segmentation and support inference from RGBD images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from RGBD images,” inECCV, 2012

work page 2012

[13] [14]

ScanNet: Richly-annotated 3d reconstructions of indoor scenes,

A. Dai, A. X. Chang, M. Savvaet al., “ScanNet: Richly-annotated 3d reconstructions of indoor scenes,” inCVPR, 2017, pp. 2432–2443

work page 2017

[14] [15]

MegaDepth: Learning single-view depth prediction from internet photos,

Z. Li and N. Snavely, “MegaDepth: Learning single-view depth prediction from internet photos,” inCVPR, 2018

work page 2018

[15] [16]

A benchmark for the evaluation of RGB-D SLAM systems,

J. Sturm, N. Engelhard, F. Endres, W. Burgard, and D. Cremers, “A benchmark for the evaluation of RGB-D SLAM systems,” inIROS, 2012, pp. 573–580

work page 2012

[16] [17]

SceneNet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation?

J. McCormac, A. Handa, S. Leutenegger, and A. J. Davison, “SceneNet RGB-D: Can 5m synthetic images beat generic imagenet pre-training on indoor segmentation?” inICCV, 2017, pp. 2678–2687

work page 2017

[17] [18]

The cityscapes dataset for semantic urban scene understanding,

M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Benen- son, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset for semantic urban scene understanding,” inCVPR, 2016

work page 2016

[18] [19]

1 Year, 1000km: The Oxford RobotCar Dataset,

W. Maddern, G. Pascoe, C. Linegar, and P. Newman, “1 Year, 1000km: The Oxford RobotCar Dataset,”The International Journal of Robotics Research (IJRR), vol. 36, no. 1, pp. 3–15, 2017

work page 2017

[19] [20]

Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios,

G. Yang, X. Song, C. Huang, Z. Deng, J. Shi, and B. Zhou, “Drivingstereo: A large-scale dataset for stereo matching in autonomous driving scenarios,” inCVPR, 2019

work page 2019

[20] [21]

Scalability in perception for autonomous driving: Waymo open dataset,

P. Sun, H. Kretzschmar, X. Dotiwalla, A. Chouard, V . Patnaik, P. Tsui, J. Guo, Y . Zhou, Y . Chai, B. Caine, V . Vasudevan, W. Han, J. Ngiam, H. Zhao, A. Timofeev, S. Ettinger, M. Krivokon, A. Gao, A. Joshi, Y . Zhang, J. Shlens, Z. Chen, and D. Anguelov, “Scalability in perception for autonomous driving: Waymo open dataset,” inCVPR, June 2020

work page 2020

[21] [22]

DIODE: A Dense Indoor and Outdoor DEpth Dataset,

I. Vasiljevic, N. Kolkin, S. Zhang, R. Luo, H. Wang, F. Z. Dai, A. F. Daniele, M. Mostajabi, S. Basart, M. R. Walter, and G. Shakhnarovich, “DIODE: A Dense Indoor and Outdoor DEpth Dataset,”CoRR, 2019

work page 2019

[22] [23]

Depth Anything V2

L. Yang, B. Kang, Z. Huang, Z. Zhao, X. Xu, J. Feng, and H. Zhao, “Depth anything v2,”arXiv preprint arXiv:2406.09414, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[23] [24]

Open challenges in deep stereo: the booster dataset,

P. Z. Ramirez, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Stefano, “Open challenges in deep stereo: the booster dataset,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 21 168–21 178

work page 2022

[24] [25]

Booster: a benchmark for depth from images of specular and transparent surfaces,

P. Z. Ramirez, A. Costanzino, F. Tosi, M. Poggi, S. Salti, S. Mattoccia, and L. Di Stefano, “Booster: a benchmark for depth from images of specular and transparent surfaces,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2023

work page 2023

[25] [26]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” inNeurIPS, 2014

work page 2014

[26] [27]

Deep ordinal regression network for monocular depth estimation,

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep ordinal regression network for monocular depth estimation,” inCVPR, 2018

work page 2018

[27] [28]

Deeper depth prediction with fully convolutional residual networks,

I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” 3DV, 2016

work page 2016

[28] [29]

Learning depth from single monocular images using deep convolutional neural fields,

F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolutional neural fields,”TPAMI, 2015

work page 2015

[29] [30]

P3Depth: Monocular depth estimation with a piecewise planarity prior,

V . Patil, C. Sakaridis, A. Liniger, and L. V . Gool, “P3Depth: Monocular depth estimation with a piecewise planarity prior,” inCVPR, 2022

work page 2022

[30] [31]

Transformer-based attention networks for continuous pixel-wise prediction,

G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continuous pixel-wise prediction,” inICCV, 2021

work page 2021

[31] [32]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inCVPR. IEEE Computer Society, 11 2020, pp. 4008–4017

work page 2020

[32] [33]

Neural window fully- connected crfs for monocular depth estimation,

W. Yuan, X. Gu, Z. Dai, S. Zhu, and P. Tan, “Neural window fully- connected crfs for monocular depth estimation,” inCVPR, 2022

work page 2022

[33] [34]

iDisc: Internal discretization for monocular depth estimation,

L. Piccinelli, C. Sakaridis, and F. Yu, “iDisc: Internal discretization for monocular depth estimation,” inCVPR, 2023

work page 2023

[34] [35]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inICCV, 2021

work page 2021

[35] [36]

Depth anything: Unleashing the power of large-scale unlabeled data,

L. Yang, B. Kang, Z. Huang, X. Xu, J. Feng, and H. Zhao, “Depth anything: Unleashing the power of large-scale unlabeled data,” in CVPR, 2024

work page 2024

[36] [37]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. Müller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,”arXiv preprint arXiv:2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[37] [38]

Towards zero-shot scale-aware monocular depth estimation,

V . Guizilini, I. Vasiljevic, D. Chen, R. Ambru s,, and A. Gaidon, “Towards zero-shot scale-aware monocular depth estimation,” inICCV, 2023

work page 2023

[38] [39]

Metric3d: Towards zero-shot metric 3d prediction from a single image,

W. Yin, C. Zhang, H. Chen, Z. Cai, G. Yu, K. Wang, X. Chen, and C. Shen, “Metric3d: Towards zero-shot metric 3d prediction from a single image,” inICCV, 2023

work page 2023

[39] [40]

Cam-convs: Camera-aware multi-scale convolutions for single-view depth,

J. M. Facil, B. Ummenhofer, H. Zhou, L. Montesano, T. Brox, and J. Civera, “Cam-convs: Camera-aware multi-scale convolutions for single-view depth,” inCVPR, 2019

work page 2019

[40] [41]

From big to small: Multi-scale local planar guidance for monocular depth estimation.arXiv preprint arXiv:1907.10326, 2019

J. H. Lee, M. Han, D. W. Ko, and I. H. Suh, “From big to small: Multi-scale local planar guidance for monocular depth estimation,” CoRR, vol. abs/1907.10326, 7 2019

work page arXiv 1907

[41] [42]

Mapillary planet-scale depth dataset,

M. L. Antequera, P. Gargallo, M. Hofinger, S. R. Bulò, Y . Kuang, and P. Kontschieder, “Mapillary planet-scale depth dataset,” inECCV, 2020

work page 2020

[42] [43]

The monocular depth estimation challenge,

J. Spencer, C. S. Qian, C. Russell, S. Hadfield, E. Graf, W. Adams, A. J. Schofield, J. H. Elder, R. Bowden, H. Conget al., “The monocular depth estimation challenge,” inProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, 2023, pp. 623–632

work page 2023

[43] [44]

The second monocular depth estimation challenge,

J. Spencer, C. S. Qian, M. Trescakova, C. Russell, S. Hadfield, E. W. Graf, W. J. Adams, A. J. Schofield, J. Elder, R. Bowdenet al., “The second monocular depth estimation challenge,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 3064–3076

work page 2023

[44] [45]

The third monocular depth estimation challenge,

J. Spencer, F. Tosi, M. Poggi, R. S. Arora, C. Russell, S. Hadfield, R. Bowden, G. Zhou, Z. Li, Q. Raoet al., “The third monocular depth estimation challenge,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 1–14

work page 2024

[45] [46]

The fourth monocular depth estimation challenge,

A. Obukhov, M. Poggi, F. Tosi, R. S. Arora, J. Spencer, C. Russel, S. Hadfield, R. Bowden, S. Wang, Z. Maet al., “The fourth monocular depth estimation challenge,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 6182–6195

work page 2025

[46] [47]

Repurposing diffusion-based image generators for monocular depth estimation,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion-based image generators for monocular depth estimation,” inCVPR, 2024

work page 2024

[47] [48]

Va-depthnet: A variational approach to single image depth prediction,

C. Liu, S. Kumar, S. Gu, R. Timofte, and L. Van Gool, “Va-depthnet: A variational approach to single image depth prediction,”arXiv preprint arXiv:2302.06556, 2023

work page arXiv 2023

[48] [49]

3d packing for self-supervised monocular depth estimation,

V . Guizilini, R. Ambrus, S. Pillai, A. Raventos, and A. Gaidon, “3d packing for self-supervised monocular depth estimation,” inCVPR, 2020

work page 2020

[49] [50]

Dcdepth: Progressive monocular depth estimation in discrete cosine domain,

K. Wang, Z. Yan, J. Fan, W. Zhu, X. Li, J. Li, and J. Yang, “Dcdepth: Progressive monocular depth estimation in discrete cosine domain,” NeurIPS, 2024

work page 2024

[50] [51]

Iebins: Iterative elastic bins for monocular depth estimation,

S. Shao, Z. Pei, X. Wu, Z. Liu, W. Chen, and Z. Li, “Iebins: Iterative elastic bins for monocular depth estimation,”NeurIPS, 2023

work page 2023

[51] [52]

Depth Pro: Sharp Monocular Metric Depth in Less Than a Second

A. Bochkovskii, A. Delaunoy, H. Germain, M. Santos, Y . Zhou, S. R. Richter, and V . Koltun, “Depth pro: Sharp monocular metric depth in less than a second,” inInternational Conference on Learning Representations, 2025. [Online]. Available: https: //arxiv.org/abs/2410.02073

work page internal anchor Pith review Pith/arXiv arXiv 2025

[52] [53]

UniK3D: Universal camera monocular 3d estimation,

L. Piccinelli, C. Sakaridis, M. Segu, Y .-H. Yang, S. Li, W. Abbeloos, and L. Van Gool, “UniK3D: Universal camera monocular 3d estimation,” inCVPR, 2025

work page 2025

[53] [54]

UniDepthV2: Universal Monocular Metric Depth Estimation Made Simpler

L. Piccinelli, C. Sakaridis, Y .-H. Yang, M. Segu, S. Li, W. Abbeloos, and L. V . Gool, “UniDepthV2: Universal monocular metric depth estimation made simpler,” 2025. [Online]. Available: https: //arxiv.org/abs/2502.20110

work page internal anchor Pith review Pith/arXiv arXiv 2025

[54] [55]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inICCV, 2021

work page 2021