Depth Completion in Unseen Field Robotics Environments Using Extremely Sparse Depth Measurements

Eleni Kelasidi; Marco Job; Michael Pantic; Roland Siegwart; Thomas Stastny

arxiv: 2602.03209 · v2 · pith:UBKXDNHRnew · submitted 2026-02-03 · 💻 cs.RO

Depth Completion in Unseen Field Robotics Environments Using Extremely Sparse Depth Measurements

Marco Job , Thomas Stastny , Eleni Kelasidi , Roland Siegwart , Michael Pantic This is my paper

Pith reviewed 2026-05-21 14:24 UTC · model grok-4.3

classification 💻 cs.RO

keywords depth completionfield roboticssynthetic data generationsparse depth measurementsmonocular depth estimationembedded deploymentunseen environmentsrobot perception

0 comments

The pith

Depth completion model trained on synthetic field data generalizes to real unseen environments using extremely sparse measurements.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that a neural network for filling in dense depth maps can be trained exclusively on synthetic data created from 3D meshes of field scenes and then applied directly to real robot operations in new locations. This approach relies on only a handful of depth measurements from a sensor to produce full metric depth images. Field robots often face novel environments where gathering real training data is difficult, so this method could make perception systems more practical and cost-effective. The network runs fast enough on embedded computers to support real-time navigation decisions.

Core claim

The authors claim that their depth completion model, trained on synthetic datasets generated through Structure from Motion textured meshes and photorealistic novel viewpoint synthesis, can predict dense metric depth in previously unseen field robotics environments from extremely sparse depth inputs, achieving an end-to-end latency of 53 ms per frame on a Nvidia Jetson AGX Orin and competitive performance in real-world tests.

What carries the argument

The synthetic dataset generation pipeline that uses textured 3D meshes from Structure from Motion and photorealistic rendering with novel viewpoint synthesis to create training data for generalization to real scenes.

If this is right

Real-time depth completion at 53 ms latency enables deployment on resource-constrained embedded platforms.
Competitive performance is demonstrated across diverse real-world field robotics scenarios without fine-tuning.
Low-cost cameras combined with sparse depth sensors can provide reliable metric depth perception in unstructured environments.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending this synthetic data approach could allow similar generalization for other sensor fusion tasks in robotics.
Field robots might operate in a wider variety of locations with reduced data collection requirements.
Future work could test the limits of how sparse the input measurements can be while maintaining accuracy.

Load-bearing premise

That the synthetic training data generated from Structure from Motion meshes and photorealistic rendering matches real field environments closely enough for the model to generalize without any domain adaptation or fine-tuning.

What would settle it

A significant drop in depth prediction accuracy when testing the model on real data from a field environment whose visual characteristics differ substantially from those in the synthetic training datasets would indicate that the generalization does not hold.

Figures

Figures reproduced from arXiv: 2602.03209 by Eleni Kelasidi, Marco Job, Michael Pantic, Roland Siegwart, Thomas Stastny.

**Figure 2.** Figure 2: From an aerial image sequence, using SfM, we obtain a textured 3D mesh of the area. With randomly sampled [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: From top to bottom row, samples of the Mountain [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: At each training step, random corners are sampled [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 5.** Figure 5: Latency, defined as the total time required to process [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗

read the original abstract

Autonomous field robots operating in unstructured environments require robust perception to ensure safe and reliable operations. Recent advances in monocular depth estimation have demonstrated the potential of low-cost cameras as depth sensors; however, their adoption in field robotics remains limited due to the absence of reliable scale cues, ambiguous or low-texture conditions, and the scarcity of large-scale datasets. To address these challenges, we propose a depth completion model that trains on synthetic data and uses extremely sparse measurements from depth sensors to predict dense metric depth in unseen field robotics environments. A synthetic dataset generation pipeline tailored to field robotics enables the creation of multiple realistic datasets for training purposes. This dataset generation approach utilizes textured 3D meshes from Structure from Motion and photorealistic rendering with novel viewpoint synthesis to simulate diverse field robotics scenarios. Our approach achieves an end-to-end latency of 53 ms per frame on a Nvidia Jetson AGX Orin, enabling real-time deployment on embedded platforms. Extensive evaluation demonstrates competitive performance across diverse real-world field robotics scenarios.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Depth completion for field robots via synthetic SfM data and sparse measurements looks practical but lacks visible quantitative backing for the generalization claims.

read the letter

The main takeaway is a depth completion approach for field robots that combines monocular images with extremely sparse depth measurements, trained on synthetic data from SfM meshes and novel view synthesis. They highlight a 53 ms latency on Jetson hardware for real-time use in unseen environments. What the paper does is adapt existing depth completion techniques to the constraints of outdoor robotics. The synthetic pipeline is a practical solution to the lack of large labeled datasets in unstructured settings. Using textured 3D meshes from Structure from Motion allows generating diverse training scenes with correct metric scale, which is a strength for this domain. The latency figure is useful and shows attention to embedded deployment. If the evaluations hold up, this could influence how teams choose sensors for agricultural or exploration robots. The soft spots are in the evidence presented. The abstract claims competitive performance and extensive evaluation but provides no specific metrics, error bars, or baseline results. This makes it difficult to assess how well the model generalizes from synthetic to real data. The assumption that photorealistic renders from SfM meshes match real field conditions closely enough—without domain adaptation—is central but unverified in the visible text, with no domain gap metrics like histogram comparisons or cross-domain tests. The stress-test concern about this closeness appears valid based on what's shown. This work is for researchers and engineers building perception systems for autonomous field robots. A reader focused on practical depth estimation in variable outdoor conditions would get value from the data generation method and the sparse input handling. It deserves a serious referee because the application addresses a real bottleneck and the approach is grounded, even if the results need fuller scrutiny. I would recommend sending it for peer review so the quantitative details can be examined.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes a depth completion model for field robotics that trains exclusively on synthetic data generated from SfM-derived textured 3D meshes using photorealistic rendering and novel viewpoint synthesis. The model ingests extremely sparse depth measurements to output dense metric depth in unseen real environments without domain adaptation or fine-tuning. It reports an end-to-end latency of 53 ms per frame on an Nvidia Jetson AGX Orin and claims competitive performance across diverse real-world field robotics scenarios.

Significance. If the empirical claims are substantiated with quantitative results, the work would be significant for enabling real-time, low-cost perception in unstructured field environments. It reduces dependence on dense depth sensors and large real-world datasets by leveraging a tailored synthetic generation pipeline, with potential for embedded deployment.

major comments (3)

[Abstract] Abstract: The claims of 'competitive performance' and 'extensive evaluation' are unsupported by any quantitative metrics, baselines, error bars, or dataset statistics. Without these, the central generalization claim to unseen real field environments cannot be assessed.
[Evaluation] Evaluation section: No quantitative domain-gap metrics (FID, depth histogram divergence, or cross-domain ablation studies) are reported to validate that the SfM-rendered synthetic distribution is sufficiently close to real unseen field data (variable lighting, vegetation, sensor noise) for generalization without fine-tuning.
[Method] Method / Experiments: The sparsity level of depth measurements is identified as a free parameter, yet no specific values, sensitivity analysis, or relation to the 53 ms latency and accuracy figures are provided, undermining reproducibility of the real-time claim.

minor comments (2)

[Abstract] Abstract: The 53 ms latency figure lacks accompanying details on network architecture, input resolution, or exact sparsity pattern used during inference.
[Abstract] The term 'extremely sparse' is used without a precise definition (e.g., points per frame or percentage of pixels) in the abstract or early sections.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive feedback. We address each major comment below and have revised the manuscript to strengthen the presentation of our results and methods.

read point-by-point responses

Referee: [Abstract] Abstract: The claims of 'competitive performance' and 'extensive evaluation' are unsupported by any quantitative metrics, baselines, error bars, or dataset statistics. Without these, the central generalization claim to unseen real field environments cannot be assessed.

Authors: We agree that the abstract would benefit from explicit quantitative support. In the revised manuscript we have updated the abstract to report key metrics (RMSE and MAE on real field datasets), baseline comparisons, and dataset statistics. The evaluation section already contains the full quantitative results with error bars; these are now referenced in the abstract to better substantiate the generalization claims. revision: yes
Referee: [Evaluation] Evaluation section: No quantitative domain-gap metrics (FID, depth histogram divergence, or cross-domain ablation studies) are reported to validate that the SfM-rendered synthetic distribution is sufficiently close to real unseen field data (variable lighting, vegetation, sensor noise) for generalization without fine-tuning.

Authors: We acknowledge the value of explicit domain-gap quantification. We have added a new subsection to the evaluation that reports FID scores between rendered synthetic images and real field images, depth histogram divergence statistics, and a cross-domain ablation comparing performance with and without synthetic-to-real adaptation. These additions directly address the concern about distribution closeness for zero-shot generalization. revision: yes
Referee: [Method] Method / Experiments: The sparsity level of depth measurements is identified as a free parameter, yet no specific values, sensitivity analysis, or relation to the 53 ms latency and accuracy figures are provided, undermining reproducibility of the real-time claim.

Authors: We agree that specific sparsity values and analysis are needed for reproducibility. The revised manuscript now states the exact sparsity levels used in all experiments (0.05 %–2 % of pixels) and includes a sensitivity plot showing accuracy and latency as functions of sparsity. The reported 53 ms latency corresponds to the 0.5 % sparsity operating point on the Jetson AGX Orin; this relationship is now explicitly stated. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical training and real-world evaluation

full rationale

The paper describes an empirical pipeline: synthetic data is generated via SfM meshes and photorealistic rendering, a neural network is trained on it, and performance is measured directly on real unseen field data with reported latency and accuracy metrics. No equations, derivations, or predictions are presented that reduce by construction to fitted inputs or self-citations; the central claims rest on external validation against real sensor measurements rather than internal redefinition or renaming of results.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The claim rests on the transferability of photorealistic synthetic scenes to real field data and on the assumption that extremely sparse depth measurements are both available and accurate enough to anchor scale.

free parameters (1)

sparsity level of depth measurements
The exact number and placement of sparse points is a design choice selected to represent minimal sensor input rather than fitted to a particular dataset.

axioms (1)

domain assumption Photorealistic rendering of SfM meshes produces training distributions close enough to real field environments for zero-shot generalization.
Invoked when the authors state that the synthetic pipeline enables training for unseen real scenarios.

pith-pipeline@v0.9.0 · 5714 in / 1436 out tokens · 73563 ms · 2026-05-21T14:24:29.265461+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We modify the convolutional layers of the pretrained encoder to accommodate a fourth input channel... The main loss function is the scale-invariant loss function proposed in [18]... L = L_si + λ_grad · L_grad.
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A synthetic dataset generation pipeline... utilizes textured 3D meshes from Structure from Motion and photorealistic rendering with novel viewpoint synthesis.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · 1 internal anchor

[1]

Past, present, and future of simulta- neous localization and mapping: Toward the robust- perception age,

C. Cadena et al., “Past, present, and future of simulta- neous localization and mapping: Toward the robust- perception age,”IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016

work page 2016
[2]

Degrada- tion resilient lidar-radar-inertial odometry,

M. Nissov, N. Khedekar, and K. Alexis, “Degrada- tion resilient lidar-radar-inertial odometry,” inIEEE International Conference on Robotics and Automation, 2024, pp. 8587–8594

work page 2024
[3]

Repurposing diffusion- based image generators for monocular depth estima- tion,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion- based image generators for monocular depth estima- tion,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9492–9502

work page 2024
[4]

Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,

M. Hu et al., “Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 579–10 596, 2024

work page 2024
[5]

Depth anything v2,

L. Yang et al., “Depth anything v2,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 21 875–21 911.DOI: 10.52202/079017-0688

work page doi:10.52202/079017-0688 2024
[6]

UniK3D: Universal camera monocular 3d estimation,

L. Piccinelli et al., “UniK3D: Universal camera monocular 3d estimation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025
[7]

Depth pro: Sharp monocu- lar metric depth in less than a second,

A. Bochkovskiy et al., “Depth pro: Sharp monocu- lar metric depth in less than a second,” inInterna- tional Conference on Representation Learning, 2025, pp. 75 602–75 637

work page 2025
[8]

Sparsity invariant cnns,

J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” inIn- ternational Conference on 3D Vision, 2017, pp. 11–20

work page 2017
[9]

Sparse-to-dense: Depth pre- diction from sparse depth samples and a single image,

F. Ma and S. Karaman, “Sparse-to-dense: Depth pre- diction from sparse depth samples and a single image,” inIEEE International Conference on Robotics and Automation, 2018, pp. 4796–4803

work page 2018
[10]

Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,

J. Qiu et al., “Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,” inIEEE Conference on Computer Vision and Pattern Recognition, 2019

work page 2019
[11]

Depth estima- tion from monocular images and sparse radar data,

J.-T. Lin, D. Dai, and L. V . Gool, “Depth estima- tion from monocular images and sparse radar data,” inIEEE/RSJ International Conference on Intelligent Robots and Systems, 2020, pp. 10 233–10 240

work page 2020
[12]

Radar-camera pixel depth associa- tion for depth completion,

Y . Long, D. Morris, X. Liu, M. Castro, P. Chakravarty, and P. Narayanan, “Radar-camera pixel depth associa- tion for depth completion,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 502–12 511

work page 2021
[13]

Advancing self-supervised monocular depth learning with sparse lidar,

Z. Feng, L. Jing, P. Yin, Y . Tian, and B. Li, “Advancing self-supervised monocular depth learning with sparse lidar,” inConference on Robot Learning, PMLR, 2022, pp. 685–694

work page 2022
[14]

G2-monodepth: A general framework of generalized depth inference from monocular rgb+x data,

H. Wang, M. Yang, and N. Zheng, “G2-monodepth: A general framework of generalized depth inference from monocular rgb+x data,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3753–3771, 2024

work page 2024
[15]

Marigold-dc: Zero-shot monoc- ular depth completion with guided diffusion,

M. Viola et al., “Marigold-dc: Zero-shot monoc- ular depth completion with guided diffusion,” in IEEE/CVF International Conference on Computer Vi- sion, 2025

work page 2025
[16]

Omni-dc: Highly robust depth completion with multiresolution depth integration,

Y . Zuo, W. Yang, Z. Ma, and J. Deng, “Omni-dc: Highly robust depth completion with multiresolution depth integration,” inIEEE/CVF International Con- ference on Computer Vision, 2025

work page 2025
[17]

Depth prompting for sensor-agnostic depth estimation,

J.-H. Park, C. Jeong, J. Lee, and H.-G. Jeon, “Depth prompting for sensor-agnostic depth estimation,” in IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2024, pp. 9859–9869

work page 2024
[18]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” inAdvances in Neural Information Processing Systems, vol. 27, 2014

work page 2014
[19]

Deeper depth prediction with fully convolutional residual networks,

I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” inF ourth Interna- tional Conference on 3D Vision, 2016, pp. 239–248

work page 2016
[20]

Learning depth from single monocular images using deep convolu- tional neural fields,

F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolu- tional neural fields,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 2024–2039, 2016

work page 2024
[21]

Deep Ordinal Regression Network for Monoc- ular Depth Estimation,

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep Ordinal Regression Network for Monoc- ular Depth Estimation,” inIEEE Conference on Com- puter Vision and Pattern Recognition, 2018

work page 2018
[22]

Transformer-based attention networks for continu- ous pixel-wise prediction,

G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continu- ous pixel-wise prediction,” inIEEE/CVF International Conference on Computer Vision, 2021, pp. 16 249– 16 259

work page 2021
[23]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2021, pp. 4009–4018

work page 2021
[24]

The surprising effectiveness of diffusion models for optical flow and monocular depth estimation,

S. Saxena et al., “The surprising effectiveness of diffusion models for optical flow and monocular depth estimation,” inInternational Conference on Neural Information Processing Systems, vol. 37, Curran As- sociates Inc., 2023

work page 2023
[25]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,

R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1623–1637, 2022

work page 2022
[26]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision,

R. Wang et al., “Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5261–5271

work page 2025
[27]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,”CoRR, vol. abs/2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[28]

Towards zero-shot scale-aware monocular depth estimation,

V . Guizilini, I. Vasiljevic, D. Chen, R. Ambrus , , and A. Gaidon, “Towards zero-shot scale-aware monocular depth estimation,” inIEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 9199–9209

work page 2023
[29]

Indoor segmentation and support inference from rgbd images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inEuropean Conference on Computer Vision, 2012, pp. 746–760

work page 2012
[30]

Diode: A dense indoor and outdoor depth dataset,

I. Vasiljevic et al., “DIODE: A Dense Indoor and Outdoor DEpth Dataset,”CoRR, vol. abs/1908.00463, 2019

work page arXiv 1908
[31]

Megadepth: Learning single- view depth prediction from internet photos,

Z. Li and N. Snavely, “Megadepth: Learning single- view depth prediction from internet photos,” in IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2018, pp. 2041–2050

work page 2018
[32]

Mid-air: A multi-modal dataset for extremely low altitude drone flights,

M. Fonder and M. Van Droogenbroeck, “Mid-air: A multi-modal dataset for extremely low altitude drone flights,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 553– 562

work page 2019
[33]

Tartanair: A dataset to push the limits of visual slam,

W. Wang et al., “Tartanair: A dataset to push the limits of visual slam,” inIEEE/RSJ International Conference on Intelligent Robots and Systems, 2020

work page 2020
[34]

Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding,

M. Roberts et al., “Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding,” inIEEE/CVF International Conference on Computer Vision, 2021, pp. 10 892–10 902

work page 2021
[35]

Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,

Y . Yao et al., “Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,” inIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, Jun. 2020

work page 2020
[36]

Enrich: Multi-purpose dataset for benchmarking in computer vision and photogrammetry,

D. Marelli, L. Morelli, E. M. Farella, S. Bianco, G. Ciocca, and F. Remondino, “Enrich: Multi-purpose dataset for benchmarking in computer vision and photogrammetry,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 198, pp. 84–98, 2023

work page 2023
[37]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablay- rolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inInter- national Conference on Machine Learning, ser. Pro- ceedings of Machine Learning Research, vol. 139, PMLR, 2021, pp. 10 347–10 357

work page 2021
[38]

Good features to track,

J. Shi and Tomasi, “Good features to track,” inIEEE Conference on Computer Vision and Pattern Recogni- tion, 1994, pp. 593–600. [39]Ballast Water Tank Dataset, https://github.com/ntnu- arl/ballast water tank dataset, Accessed: 2025-08-18, Mar. 2024

work page 1994
[39]

Online refractive cam- era model calibration in visual inertial odometry,

M. Singh and K. Alexis, “Online refractive cam- era model calibration in visual inertial odometry,” inIEEE/RSJ International Conference on Intelligent Robots and Systems, 2024, pp. 12 609–12 616

work page 2024
[40]

Structure-from- motion revisited,

J. L. Sch ¨onberger and J.-M. Frahm, “Structure-from- motion revisited,” inConference on Computer Vision and Pattern Recognition, 2016

work page 2016
[41]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learn- ing Representations, 2017

work page 2017
[42]

SGDR: Stochastic gra- dient descent with warm restarts,

I. Loshchilov and F. Hutter, “SGDR: Stochastic gra- dient descent with warm restarts,” inInternational Conference on Learning Representations, 2017

work page 2017
[43]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inIEEE/CVF In- ternational Conference on Computer Vision, 2021, pp. 12 179–12 188

work page 2021

[1] [1]

Past, present, and future of simulta- neous localization and mapping: Toward the robust- perception age,

C. Cadena et al., “Past, present, and future of simulta- neous localization and mapping: Toward the robust- perception age,”IEEE Transactions on Robotics, vol. 32, no. 6, pp. 1309–1332, 2016

work page 2016

[2] [2]

Degrada- tion resilient lidar-radar-inertial odometry,

M. Nissov, N. Khedekar, and K. Alexis, “Degrada- tion resilient lidar-radar-inertial odometry,” inIEEE International Conference on Robotics and Automation, 2024, pp. 8587–8594

work page 2024

[3] [3]

Repurposing diffusion- based image generators for monocular depth estima- tion,

B. Ke, A. Obukhov, S. Huang, N. Metzger, R. C. Daudt, and K. Schindler, “Repurposing diffusion- based image generators for monocular depth estima- tion,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2024, pp. 9492–9502

work page 2024

[4] [4]

Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,

M. Hu et al., “Metric3d v2: A versatile monocular geometric foundation model for zero-shot metric depth and surface normal estimation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 12, pp. 10 579–10 596, 2024

work page 2024

[5] [5]

Depth anything v2,

L. Yang et al., “Depth anything v2,” inAdvances in Neural Information Processing Systems, vol. 37, 2024, pp. 21 875–21 911.DOI: 10.52202/079017-0688

work page doi:10.52202/079017-0688 2024

[6] [6]

UniK3D: Universal camera monocular 3d estimation,

L. Piccinelli et al., “UniK3D: Universal camera monocular 3d estimation,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025

work page 2025

[7] [7]

Depth pro: Sharp monocu- lar metric depth in less than a second,

A. Bochkovskiy et al., “Depth pro: Sharp monocu- lar metric depth in less than a second,” inInterna- tional Conference on Representation Learning, 2025, pp. 75 602–75 637

work page 2025

[8] [8]

Sparsity invariant cnns,

J. Uhrig, N. Schneider, L. Schneider, U. Franke, T. Brox, and A. Geiger, “Sparsity invariant cnns,” inIn- ternational Conference on 3D Vision, 2017, pp. 11–20

work page 2017

[9] [9]

Sparse-to-dense: Depth pre- diction from sparse depth samples and a single image,

F. Ma and S. Karaman, “Sparse-to-dense: Depth pre- diction from sparse depth samples and a single image,” inIEEE International Conference on Robotics and Automation, 2018, pp. 4796–4803

work page 2018

[10] [10]

Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,

J. Qiu et al., “Deeplidar: Deep surface normal guided depth prediction for outdoor scene from sparse lidar data and single color image,” inIEEE Conference on Computer Vision and Pattern Recognition, 2019

work page 2019

[11] [11]

Depth estima- tion from monocular images and sparse radar data,

J.-T. Lin, D. Dai, and L. V . Gool, “Depth estima- tion from monocular images and sparse radar data,” inIEEE/RSJ International Conference on Intelligent Robots and Systems, 2020, pp. 10 233–10 240

work page 2020

[12] [12]

Radar-camera pixel depth associa- tion for depth completion,

Y . Long, D. Morris, X. Liu, M. Castro, P. Chakravarty, and P. Narayanan, “Radar-camera pixel depth associa- tion for depth completion,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 12 502–12 511

work page 2021

[13] [13]

Advancing self-supervised monocular depth learning with sparse lidar,

Z. Feng, L. Jing, P. Yin, Y . Tian, and B. Li, “Advancing self-supervised monocular depth learning with sparse lidar,” inConference on Robot Learning, PMLR, 2022, pp. 685–694

work page 2022

[14] [14]

G2-monodepth: A general framework of generalized depth inference from monocular rgb+x data,

H. Wang, M. Yang, and N. Zheng, “G2-monodepth: A general framework of generalized depth inference from monocular rgb+x data,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 5, pp. 3753–3771, 2024

work page 2024

[15] [15]

Marigold-dc: Zero-shot monoc- ular depth completion with guided diffusion,

M. Viola et al., “Marigold-dc: Zero-shot monoc- ular depth completion with guided diffusion,” in IEEE/CVF International Conference on Computer Vi- sion, 2025

work page 2025

[16] [16]

Omni-dc: Highly robust depth completion with multiresolution depth integration,

Y . Zuo, W. Yang, Z. Ma, and J. Deng, “Omni-dc: Highly robust depth completion with multiresolution depth integration,” inIEEE/CVF International Con- ference on Computer Vision, 2025

work page 2025

[17] [17]

Depth prompting for sensor-agnostic depth estimation,

J.-H. Park, C. Jeong, J. Lee, and H.-G. Jeon, “Depth prompting for sensor-agnostic depth estimation,” in IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2024, pp. 9859–9869

work page 2024

[18] [18]

Depth map prediction from a single image using a multi-scale deep network,

D. Eigen, C. Puhrsch, and R. Fergus, “Depth map prediction from a single image using a multi-scale deep network,” inAdvances in Neural Information Processing Systems, vol. 27, 2014

work page 2014

[19] [19]

Deeper depth prediction with fully convolutional residual networks,

I. Laina, C. Rupprecht, V . Belagiannis, F. Tombari, and N. Navab, “Deeper depth prediction with fully convolutional residual networks,” inF ourth Interna- tional Conference on 3D Vision, 2016, pp. 239–248

work page 2016

[20] [20]

Learning depth from single monocular images using deep convolu- tional neural fields,

F. Liu, C. Shen, G. Lin, and I. Reid, “Learning depth from single monocular images using deep convolu- tional neural fields,”IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 38, no. 10, pp. 2024–2039, 2016

work page 2024

[21] [21]

Deep Ordinal Regression Network for Monoc- ular Depth Estimation,

H. Fu, M. Gong, C. Wang, K. Batmanghelich, and D. Tao, “Deep Ordinal Regression Network for Monoc- ular Depth Estimation,” inIEEE Conference on Com- puter Vision and Pattern Recognition, 2018

work page 2018

[22] [22]

Transformer-based attention networks for continu- ous pixel-wise prediction,

G. Yang, H. Tang, M. Ding, N. Sebe, and E. Ricci, “Transformer-based attention networks for continu- ous pixel-wise prediction,” inIEEE/CVF International Conference on Computer Vision, 2021, pp. 16 249– 16 259

work page 2021

[23] [23]

Adabins: Depth estimation using adaptive bins,

S. F. Bhat, I. Alhashim, and P. Wonka, “Adabins: Depth estimation using adaptive bins,” inIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, 2021, pp. 4009–4018

work page 2021

[24] [24]

The surprising effectiveness of diffusion models for optical flow and monocular depth estimation,

S. Saxena et al., “The surprising effectiveness of diffusion models for optical flow and monocular depth estimation,” inInternational Conference on Neural Information Processing Systems, vol. 37, Curran As- sociates Inc., 2023

work page 2023

[25] [25]

Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,

R. Ranftl, K. Lasinger, D. Hafner, K. Schindler, and V . Koltun, “Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 44, no. 3, pp. 1623–1637, 2022

work page 2022

[26] [26]

Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision,

R. Wang et al., “Moge: Unlocking accurate monocular geometry estimation for open-domain images with optimal training supervision,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 5261–5271

work page 2025

[27] [27]

ZoeDepth: Zero-shot Transfer by Combining Relative and Metric Depth

S. F. Bhat, R. Birkl, D. Wofk, P. Wonka, and M. M ¨uller, “Zoedepth: Zero-shot transfer by combining relative and metric depth,”CoRR, vol. abs/2302.12288, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[28] [28]

Towards zero-shot scale-aware monocular depth estimation,

V . Guizilini, I. Vasiljevic, D. Chen, R. Ambrus , , and A. Gaidon, “Towards zero-shot scale-aware monocular depth estimation,” inIEEE/CVF International Confer- ence on Computer Vision, 2023, pp. 9199–9209

work page 2023

[29] [29]

Indoor segmentation and support inference from rgbd images,

N. Silberman, D. Hoiem, P. Kohli, and R. Fergus, “Indoor segmentation and support inference from rgbd images,” inEuropean Conference on Computer Vision, 2012, pp. 746–760

work page 2012

[30] [30]

Diode: A dense indoor and outdoor depth dataset,

I. Vasiljevic et al., “DIODE: A Dense Indoor and Outdoor DEpth Dataset,”CoRR, vol. abs/1908.00463, 2019

work page arXiv 1908

[31] [31]

Megadepth: Learning single- view depth prediction from internet photos,

Z. Li and N. Snavely, “Megadepth: Learning single- view depth prediction from internet photos,” in IEEE/CVF Conference on Computer Vision and Pat- tern Recognition, 2018, pp. 2041–2050

work page 2018

[32] [32]

Mid-air: A multi-modal dataset for extremely low altitude drone flights,

M. Fonder and M. Van Droogenbroeck, “Mid-air: A multi-modal dataset for extremely low altitude drone flights,” inIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, 2019, pp. 553– 562

work page 2019

[33] [33]

Tartanair: A dataset to push the limits of visual slam,

W. Wang et al., “Tartanair: A dataset to push the limits of visual slam,” inIEEE/RSJ International Conference on Intelligent Robots and Systems, 2020

work page 2020

[34] [34]

Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding,

M. Roberts et al., “Hypersim: A photorealistic syn- thetic dataset for holistic indoor scene understanding,” inIEEE/CVF International Conference on Computer Vision, 2021, pp. 10 892–10 902

work page 2021

[35] [35]

Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,

Y . Yao et al., “Blendedmvs: A large-scale dataset for generalized multi-view stereo networks,” inIEEE/CVF Conference on Computer Vision and Pattern Recogni- tion, Jun. 2020

work page 2020

[36] [36]

Enrich: Multi-purpose dataset for benchmarking in computer vision and photogrammetry,

D. Marelli, L. Morelli, E. M. Farella, S. Bianco, G. Ciocca, and F. Remondino, “Enrich: Multi-purpose dataset for benchmarking in computer vision and photogrammetry,”ISPRS Journal of Photogrammetry and Remote Sensing, vol. 198, pp. 84–98, 2023

work page 2023

[37] [37]

Training data-efficient image transformers & distillation through attention,

H. Touvron, M. Cord, M. Douze, F. Massa, A. Sablay- rolles, and H. J ´egou, “Training data-efficient image transformers & distillation through attention,” inInter- national Conference on Machine Learning, ser. Pro- ceedings of Machine Learning Research, vol. 139, PMLR, 2021, pp. 10 347–10 357

work page 2021

[38] [38]

Good features to track,

J. Shi and Tomasi, “Good features to track,” inIEEE Conference on Computer Vision and Pattern Recogni- tion, 1994, pp. 593–600. [39]Ballast Water Tank Dataset, https://github.com/ntnu- arl/ballast water tank dataset, Accessed: 2025-08-18, Mar. 2024

work page 1994

[39] [39]

Online refractive cam- era model calibration in visual inertial odometry,

M. Singh and K. Alexis, “Online refractive cam- era model calibration in visual inertial odometry,” inIEEE/RSJ International Conference on Intelligent Robots and Systems, 2024, pp. 12 609–12 616

work page 2024

[40] [40]

Structure-from- motion revisited,

J. L. Sch ¨onberger and J.-M. Frahm, “Structure-from- motion revisited,” inConference on Computer Vision and Pattern Recognition, 2016

work page 2016

[41] [41]

Decoupled weight decay regularization,

I. Loshchilov and F. Hutter, “Decoupled weight decay regularization,” inInternational Conference on Learn- ing Representations, 2017

work page 2017

[42] [42]

SGDR: Stochastic gra- dient descent with warm restarts,

I. Loshchilov and F. Hutter, “SGDR: Stochastic gra- dient descent with warm restarts,” inInternational Conference on Learning Representations, 2017

work page 2017

[43] [43]

Vision transformers for dense prediction,

R. Ranftl, A. Bochkovskiy, and V . Koltun, “Vision transformers for dense prediction,” inIEEE/CVF In- ternational Conference on Computer Vision, 2021, pp. 12 179–12 188

work page 2021