arxiv: 2605.07418 · v1 · submitted 2026-05-08 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

Learning Image-Adaptive Scale Fields for Metric Depth Recovery

Yuanyan Li , Matthias Althoff

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords monocular depth estimationmetric depth recoveryimage-adaptive scale fieldbasis mapssparse anchorsleast-squares fittingdepth correction

0 comments

The pith

Monocular depth estimates are converted to accurate metric depths by fitting a low-dimensional image-adaptive scale field to sparse anchors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the problem of turning scale-ambiguous monocular depth estimates into metric depths when only a few known metric points are available. It reformulates the scale correction as a linear combination of basis maps that adapt to the image content. These basis maps are generated from the semantic and geometric cues already present in the depth network's estimates and internal features. The weights for the combination are found by solving a simple least-squares problem using the sparse anchors. This produces more accurate metric depths, remains stable even with very few anchors, and breaks down the scale variations in an interpretable way.

Core claim

We address this problem by formulating metric depth recovery as image-adaptive scale field modeling. Instead of directly correcting the depth, we reformulate the correction as a low-dimensional linear combination of image-adaptive basis maps. These maps are derived from semantic and geometric cues encoded in the MDE estimations and intermediate representations. The weights of basis maps are efficiently determined from sparse metric anchors via a least-squares problem. This formulation yields improved metric depth accuracy, strong robustness under extreme anchor sparsity, and an interpretable decomposition of spatial scale variations.

What carries the argument

The image-adaptive scale field, expressed as a low-dimensional linear combination of basis maps derived from the monocular depth network's semantic and geometric cues, whose weights are solved via least-squares from sparse metric anchors.

If this is right

Metric depth accuracy improves over direct correction approaches when sparse anchors are present.
The method remains effective even when the number of available metric anchors drops to extreme sparsity levels.
The scale variations across an image can be decomposed into interpretable components tied to semantic and geometric regions.
The approach applies to a wide range of existing monocular depth estimation models without requiring their retraining.
Results hold consistently across multiple standard depth datasets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decomposition into basis maps could reveal which image regions carry the highest scale uncertainty in the original monocular estimate.
Jointly optimizing the depth network features together with the scale-field weights might reduce residual errors further.
Applying the same basis-map construction to video would likely produce temporally consistent metric depths by propagating the scale field across frames.

Load-bearing premise

The semantic and geometric cues encoded in the MDE estimations and intermediate representations are sufficient to generate basis maps that capture the true spatial scale variations across the image.

What would settle it

A dataset of images with dense ground-truth metric depth maps and deliberately placed sparse anchors, where the method's recovered depths show large errors relative to ground truth in regions whose scale variations are not aligned with the network's extracted cues.

Figures

Figures reproduced from arXiv: 2605.07418 by Matthias Althoff, Yuanyan Li.

**Figure 2.** Figure 2: Example from SUN RGB-D [21] with DA3 [4]. Left to right: scale field, intermediate representation, [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Results of the ablation study under the medium-anchor regime across 3 datasets (weighted by [PITH_FULL_IMAGE:figures/full_fig_p010_3.png] view at source ↗

read the original abstract

Monocular depth estimation (MDE) typically produces depth estimations that are defined up to an unknown scale or shift. When only sparse metric anchors are available, recovering accurate metric depth becomes challenging yet necessary for practical applications. We address this problem by formulating metric depth recovery as image-adaptive scale field modeling. Instead of directly correcting the depth, we reformulate the correction as a low-dimensional linear combination of image-adaptive basis maps. These maps are derived from semantic and geometric cues encoded in the MDE estimations and intermediate representations. The weights of basis maps are efficiently determined from sparse metric anchors via a least-squares problem. This formulation yields improved metric depth accuracy, strong robustness under extreme anchor sparsity, and an interpretable decomposition of spatial scale variations. Extensive experiments across multiple datasets and representative MDE models demonstrate the effectiveness and general applicability of our approach.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes scale correction in monocular depth as a low-dimensional linear fit of image-adaptive basis maps pulled from the MDE's own features, solved by least squares on sparse anchors.

read the letter

The core idea is to stop tweaking the depth values directly and instead model the unknown scale field as a weighted sum of a few basis maps. Those maps come from semantic and geometric cues already present in the monocular depth estimator and its intermediate layers. Once the maps are built, the weights are found with a standard least-squares solve against the available metric anchors. This keeps the correction efficient and gives a somewhat interpretable breakdown of how scale varies across the image.

Referee Report

2 major / 2 minor

Summary. The paper formulates metric depth recovery from monocular depth estimation (MDE) as image-adaptive scale field modeling: metric depth is recovered as relative_depth multiplied by a linear combination of basis maps (derived from semantic and geometric cues in MDE outputs and intermediate features), with the combination weights solved via least-squares on sparse metric anchors. The approach is claimed to deliver improved accuracy, robustness to extreme anchor sparsity, and an interpretable decomposition, with supporting experiments across datasets and MDE models.

Significance. If the basis maps adequately span true spatial scale variations, the method supplies an efficient, low-parameter, and interpretable alternative to direct regression or global optimization for scale correction. The least-squares step is computationally lightweight and the decomposition into basis maps aids analysis of scale variations. These strengths would make the technique broadly applicable for practical MDE deployment when only sparse anchors are available.

major comments (2)

[§3] §3 (scale-field formulation, around the definition metric_depth = relative_depth * sum w_i * basis_i): the central claim that the MDE-derived basis maps linearly approximate the ground-truth scale field (GT_metric / relative_depth) across the full image is load-bearing for both the accuracy and 'strong robustness under extreme anchor sparsity' assertions. No derivation, completeness argument, or bound is supplied showing that the chosen semantic/geometric cues span the required function space, leaving open the risk that systematic MDE biases produce misaligned basis maps in anchor-free regions.
[§4] §4 (experiments and tables reporting accuracy under varying anchor counts): the quantitative improvements must be accompanied by ablations that isolate the contribution of the adaptive basis maps versus a plain least-squares fit on the same anchors; without this, it is unclear whether the reported gains under 1- or 5-anchor regimes stem from the image-adaptive formulation or simply from the least-squares solver itself.

minor comments (2)

[Abstract] Abstract: the performance claims would be more informative if the abstract briefly referenced the key error metrics (e.g., AbsRel, RMSE) and the range of anchor densities tested.
[Figures] Figure captions and notation: ensure that the symbols for relative depth, metric depth, and the basis maps are defined consistently on first use and that all figure panels are referenced in the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, providing clarifications on our modeling assumptions and outlining the revisions we will make to improve the theoretical discussion and experimental validation.

read point-by-point responses

Referee: [§3] §3 (scale-field formulation, around the definition metric_depth = relative_depth * sum w_i * basis_i): the central claim that the MDE-derived basis maps linearly approximate the ground-truth scale field (GT_metric / relative_depth) across the full image is load-bearing for both the accuracy and 'strong robustness under extreme anchor sparsity' assertions. No derivation, completeness argument, or bound is supplied showing that the chosen semantic/geometric cues span the required function space, leaving open the risk that systematic MDE biases produce misaligned basis maps in anchor-free regions.

Authors: We acknowledge that the linear approximation of the scale field via MDE-derived basis maps is a key modeling choice without a formal completeness proof or error bound. The basis maps are constructed from a diverse set of semantic and geometric cues extracted from MDE outputs and intermediate features, which empirically capture the dominant spatial modes of scale variation in real-world scenes. While we do not claim that these cues form a complete basis for all possible scale fields, the data-driven least-squares fitting on anchors allows the model to adapt weights effectively, and our experiments show consistent performance even under extreme sparsity. To strengthen the presentation, we will add a paragraph in Section 3 discussing the rationale for this approximation, its assumptions, and potential limitations in anchor-free regions. revision: partial
Referee: [§4] §4 (experiments and tables reporting accuracy under varying anchor counts): the quantitative improvements must be accompanied by ablations that isolate the contribution of the adaptive basis maps versus a plain least-squares fit on the same anchors; without this, it is unclear whether the reported gains under 1- or 5-anchor regimes stem from the image-adaptive formulation or simply from the least-squares solver itself.

Authors: We agree that an ablation isolating the adaptive basis maps is necessary to clarify the source of the reported gains. A plain least-squares fit without the basis maps reduces to estimating a single global scale (a degenerate case of our model using only a constant basis), whereas our formulation enables spatially varying corrections through the linear combination of multiple image-adaptive maps. In the revised manuscript, we will add ablation studies comparing the full model against global scale correction and reduced-basis variants, all using identical anchor sets across the 1- and 5-anchor regimes. These results will be included in Section 4 to quantify the benefit of the adaptive component. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation uses standard fitting on independent basis maps

full rationale

The derivation chain is: MDE produces relative depth plus semantic/geometric features; basis maps are generated from those features; weights are solved by least-squares on sparse metric anchors; final metric depth is relative_depth multiplied by the linear combination. This is a standard linear model fit where the scale field is explicitly constructed as a low-dimensional expansion. The fit matches anchors by design at those points, but the method's purpose is to extrapolate the scale field to the full image using the basis maps, which are not defined in terms of the output depth or the fitted weights. No step renames a fitted quantity as an independent prediction, invokes a self-citation uniqueness theorem, or smuggles an ansatz. Accuracy and robustness claims are presented as outcomes of experiments across datasets, not as logical consequences of the equations themselves. The formulation is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Review performed on abstract only; full technical details, equations, and experimental setup unavailable. Free parameters, axioms, and invented entities cannot be enumerated precisely.

invented entities (1)

image-adaptive basis maps no independent evidence
purpose: To represent spatial variations in scale as a low-dimensional linear combination derived from semantic and geometric cues in MDE outputs
Introduced in the abstract as the core modeling choice for scale field; no independent evidence provided in abstract

pith-pipeline@v0.9.0 · 5434 in / 1222 out tokens · 36846 ms · 2026-05-11T01:53:03.260445+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We parameterize the scale field s(x) ... ℓ(x) = E(x)⊤w = Σ wm Em(x) ... ˆD(x) = DMDE(x) e^ℓ(x) ... w∗ = (M⊤M + λI)−1 M⊤y
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean absolute_floor_iff_bare_distinguishability unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Basis maps derived from semantic and geometric cues encoded in the MDE estimations and intermediate representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages

[1]

Vision transformers for dense prediction

Ren´ e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. In 2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 12159–12168, 2021

work page 2021
[2]

Maxime Oquab, Timoth´ ee Darcet, Theo Moutakanni, Huy V. Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Russell Howes, Po-Yao Huang, Hu Xu, Vasu Sharma, Shang-Wen Li, Wojciech Galuba, Mike Rabbat, Mido Assran, Nicolas Ballas, Gabriel Synnaeve, Ishan Misra, Herve Jegou, Julien Mairal, Patrick Laba...

work page 2023
[3]

Depth any- thing: Unleashing the power of large-scale unlabeled data

Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing: Unleashing the power of large-scale unlabeled data. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024

work page 2024
[4]

Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang

Haotong Lin, Sili Chen, Junhao Liew, Donny Y. Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: Recovering the visual space from any views, 2025

work page 2025
[5]

Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023

Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M¨ uller. Zoedepth: Zero-shot transfer by combining relative and metric depth, 2023

work page 2023
[6]

Metric3d: Towards zero-shot metric 3d prediction from a single image

Wei Yin, Chi Zhang, Hao Chen, Zhipeng Cai, Gang Yu, Kaixuan Wang, Xiaozhi Chen, and Chunhua Shen. Metric3d: Towards zero-shot metric 3d prediction from a single image. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9043–9053, October 2023

work page 2023
[7]

Richter, and Vladlen Koltun

Aleksei Bochkovskii, Ama¨ el Delaunoy, Hugo Germain, Marcel Santos, Yichao Zhou, Stephan R. Richter, and Vladlen Koltun. Depth pro: Sharp monocular metric depth in less than a second. InInternational Conference on Learning Representations, 2025

work page 2025
[8]

Ren´ e Ranftl, Katrin Lasinger, David Hafner, Konrad Schindler, and Vladlen Koltun. Towards robust monocular depth estimation: Mixing datasets for zero-shot cross-dataset transfer.IEEE Transactions on Pattern Analysis and Machine Intelligence, 44(3):1623–1637, 2022

work page 2022
[9]

Survey on monocular metric depth estimation.Com- puters, 14(11), 2025

Jiuling Zhang, Yurong Wu, and Huilong Jiang. Survey on monocular metric depth estimation.Com- puters, 14(11), 2025

work page 2025
[10]

A simple yet effective test-time adaptation for zero-shot monocular metric depth estimation, 2025

R´ emi Marsal, Alexandre Chapoutot, Philippe Xu, and David Filliat. A simple yet effective test-time adaptation for zero-shot monocular metric depth estimation, 2025

work page 2025
[11]

Cambridge University Press, 2 edition, 2004

Richard Hartley and Andrew Zisserman.Multiple View Geometry in Computer Vision. Cambridge University Press, 2 edition, 2004

work page 2004
[12]

S. Umeyama. Least-squares estimation of transformation parameters between two point patterns.IEEE Transactions on Pattern Analysis and Machine Intelligence, 13(4):376–380, 1991

work page 1991
[13]

A robust metric dis- tance and height estimation pipeline for wildlife camera trap imagery.Ecological Informatics, 92:103520, 2025

Muhammad Aamir, Matthew Wijers, Andrew Loveridge, and Andrew Markham. A robust metric dis- tance and height estimation pipeline for wildlife camera trap imagery.Ecological Informatics, 92:103520, 2025

work page 2025
[14]

Toward 3d scene reconstruction from locally scale-aligned monocular video depth.Journal of University of Science and Technology of China, 54(4):0402, 2024

Guokai Xu and Feng Zhao. Toward 3d scene reconstruction from locally scale-aligned monocular video depth.Journal of University of Science and Technology of China, 54(4):0402, 2024. 11

work page 2024
[15]

Hi-slam2: Geometry-aware gaussian slam for fast monocular scene reconstruction.IEEE Transactions on Robotics, 41:6478–6493, 2025

Wei Zhang, Qing Cheng, David Skuddis, Niclas Zeller, Daniel Cremers, and Norbert Haala. Hi-slam2: Geometry-aware gaussian slam for fast monocular scene reconstruction.IEEE Transactions on Robotics, 41:6478–6493, 2025

work page 2025
[16]

Region-aware depth scale adaptation with sparse measurements, 2025

Rizhao Fan, Tianfang Ma, Zhigen Li, Ning An, and Jian Cheng. Region-aware depth scale adaptation with sparse measurements, 2025

work page 2025
[17]

Depth map prediction from a single image using a multi-scale deep network

David Eigen, Christian Puhrsch, and Rob Fergus. Depth map prediction from a single image using a multi-scale deep network. InNeural Information Processing Systems, 2014

work page 2014
[18]

Oneformer: One transformer to rule universal image segmentation

Jitesh Jain, Jiachen Li, Mang Tik Chiu, Ali Hassani, Nikita Orlov, and Humphrey Shi. Oneformer: One transformer to rule universal image segmentation. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 2989–2998, 2023

work page 2023
[19]

Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick

Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C. Berg, Wan-Yen Lo, Piotr Dollar, and Ross Girshick. Segment anything. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 4015–4026, October 2023

work page 2023
[20]

Springer, 2 edition, 2009

Trevor Hastie, Robert Tibshirani, and Jerome Friedman.The elements of statistical learning: data mining, inference, and prediction. Springer, 2 edition, 2009

work page 2009
[21]

Lichtenberg, and Jianxiong Xiao

Shuran Song, Samuel P. Lichtenberg, and Jianxiong Xiao. Sun rgb-d: A rgb-d scene understanding benchmark suite. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 567–576, June 2015

work page 2015
[22]

Bishop.Pattern Recognition and Machine Learning

Christopher M. Bishop.Pattern Recognition and Machine Learning. Information Science and Statistics. Springer, New York, 2006

work page 2006
[23]

Fast r-cnn

Ross Girshick. Fast r-cnn. In2015 IEEE International Conference on Computer Vision (ICCV), pages 1440–1448, 2015

work page 2015
[24]

Vision meets robotics: The kitti dataset.International Journal of Robotics Research, 32(11):1231–1237, 2013

Andreas Geiger, Philip Lenz, Christoph Stiller, and Raquel Urtasun. Vision meets robotics: The kitti dataset.International Journal of Robotics Research, 32(11):1231–1237, 2013

work page 2013
[25]

G´ omez, Manuel Silva, Antonio Seoane, Agn´ es Borr` as, Mario Noriega, German Ros, Jose A

Jose L. G´ omez, Manuel Silva, Antonio Seoane, Agn´ es Borr` as, Mario Noriega, German Ros, Jose A. Iglesias-Guitian, and Antonio M. L´ opez. All for one, and one for all: Urbansyn dataset, the third musketeer of synthetic driving scenes.Neurocomputing, 637:130038, 2025

work page 2025
[26]

Adabins: Depth estimation using adaptive bins

Shariq Farooq Bhat, Ibraheem Alhashim, and Peter Wonka. Adabins: Depth estimation using adaptive bins. In2021 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4008–4017, 2021

work page 2021
[27]

Cl´ ement Godard, Oisin Mac Aodha, and Gabriel J. Brostow. Unsupervised monocular depth estimation with left-right consistency. In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6602–6611, 2017

work page 2017
[28]

Digging into self- supervised monocular depth estimation

Cl´ ement Godard, Oisin Mac Aodha, Michael Firman, and Gabriel J Brostow. Digging into self- supervised monocular depth estimation. InProceedings of the IEEE/CVF international conference on computer vision, pages 3828–3838, 2019

work page 2019
[29]

Deep Ordinal Regression Network for Monocular Depth Estimation

Huan Fu, Mingming Gong, Chaohui Wang, Kayhan Batmanghelich, and Dacheng Tao. Deep Ordinal Regression Network for Monocular Depth Estimation. InIEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018

work page 2018
[30]

Deeper depth prediction with fully convolutional residual networks

Iro Laina, Christian Rupprecht, Vasileios Belagiannis, Federico Tombari, and Nassir Navab. Deeper depth prediction with fully convolutional residual networks. In3D Vision (3DV), 2016 Fourth Inter- national Conference on, pages 239–248. IEEE, 2016

work page 2016
[31]

Fischler and Robert C

Martin A. Fischler and Robert C. Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commun. ACM, 24(6):381–395, June 1981. 12

work page 1981