Recognition: no theorem link
Sat3R: Satellite DSM Reconstruction via RPC-Aware Depth Fine-tuning
Pith reviewed 2026-05-11 01:21 UTC · model grok-4.3
The pith
A fine-tuned monocular depth model using RPC geometry matches optimization accuracy for satellite DSM at over 300 times the speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Sat3R constructs physically consistent pseudo depth supervision from RPC geometry and uses it to fine-tune Depth Anything V2 with the Scale-Invariant Logarithmic loss. This RPC-aware metric depth fine-tuning adapts the model to the satellite domain, enabling feed-forward DSM reconstruction that reduces MAE by 38 percent over zero-shot baselines while achieving competitive accuracy against optimization-based methods at more than 300 times the speed on the DFC2019 benchmark.
What carries the argument
RPC-aware metric depth fine-tuning that adapts a monocular depth foundation model using physically consistent pseudo depth supervision derived from Rational Polynomial Camera geometry.
Load-bearing premise
Pseudo depth maps constructed from RPC geometry supply accurate and unbiased training signals that are sufficient to adapt the foundation model to satellite imagery without introducing systematic errors or needing per-scene optimization.
What would settle it
If Sat3R produces no MAE reduction over zero-shot baselines or falls short of optimization-based accuracy on the DFC2019 benchmark and similar satellite test sets with varied RPC parameters, the claim that RPC-aware fine-tuning bridges the domain gap would be falsified.
Figures
read the original abstract
Accurate Digital Surface Model (DSM) reconstruction from satellite imagery is critical for applications such as disaster response, urban planning, and large-scale geographic mapping. Existing approaches face a fundamental trade-off: optimization-based methods achieve strong accuracy but require hours of per-scene computation, while generalizable geometry foundation models offer near-instant inference but fail to generalize to satellite imagery due to the domain gap introduced by the Rational Polynomial Camera (RPC) model and mismatched depth scale distributions. We present Sat3R, a feed-forward framework that bridges this gap via RPC-aware metric depth fine-tuning of Depth Anything V2 using the Scale-Invariant Logarithmic (SiLog) loss. By constructing physically consistent pseudo depth supervision from RPC geometry, Sat3R adapts a monocular depth foundation model to the satellite domain without per-scene optimization. Experiments on the DFC2019 benchmark demonstrate that Sat3R reduces MAE by 38% over zero-shot feed-forward baselines and achieves competitive accuracy against optimization-based methods, while delivering over 300x speedup. Sat3R demonstrates that feed-forward models, when properly adapted to the satellite domain, can match optimization-based accuracy at a fraction of the computational cost, paving the way for practical large-scale satellite DSM reconstruction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces Sat3R, a feed-forward framework for satellite DSM reconstruction that adapts the Depth Anything V2 monocular depth foundation model via RPC-aware metric depth fine-tuning. It constructs physically consistent pseudo-depth supervision from Rational Polynomial Camera (RPC) geometry and optimizes with the Scale-Invariant Logarithmic (SiLog) loss, avoiding per-scene optimization. On the DFC2019 benchmark, Sat3R reports a 38% MAE reduction over zero-shot feed-forward baselines, competitive accuracy with optimization-based methods, and >300x speedup.
Significance. If the pseudo-depth supervision proves accurate and unbiased, the work would demonstrate that domain-adapted feed-forward models can close the accuracy gap with slow optimization-based DSM pipelines while retaining near-instant inference, enabling practical large-scale satellite mapping for disaster response and urban planning.
major comments (2)
- [§3.2] §3.2 (Pseudo-depth supervision construction): The central adaptation claim rests on RPC-derived pseudo depths supplying accurate, unbiased training signals for fine-tuning. The manuscript states these labels are 'physically consistent' but supplies no quantitative validation (e.g., MAE, scale bias, or residual statistics of the pseudo depths versus DFC2019 ground-truth DSM on the fine-tuning scenes). RPC models are known approximations; without this check it is impossible to attribute the reported 38% MAE gain to the fine-tuning procedure rather than to the quality or bias of the supervision.
- [§4.2] §4.2 (Experiments and ablations): The results claim a 38% MAE reduction and competitive accuracy, yet the text provides no error bars, multiple-run statistics, or ablation isolating the contribution of RPC-aware supervision versus standard fine-tuning. This makes it difficult to verify that the gains are robust and directly attributable to the proposed RPC-aware component.
minor comments (2)
- [Abstract] Abstract and §1: The term 'physically consistent' is used without a precise definition or reference to the RPC residual model; a short clarifying sentence would improve readability.
- [§4.1] §4.1: Table captions and axis labels in the quantitative comparison figures could more explicitly list the exact baselines (zero-shot vs. fine-tuned) to avoid reader confusion.
Simulated Author's Rebuttal
We thank the referee for the constructive comments, which highlight important aspects for strengthening the presentation of our work. We address each major comment below and will revise the manuscript to incorporate additional validation and analysis.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Pseudo-depth supervision construction): The central adaptation claim rests on RPC-derived pseudo depths supplying accurate, unbiased training signals for fine-tuning. The manuscript states these labels are 'physically consistent' but supplies no quantitative validation (e.g., MAE, scale bias, or residual statistics of the pseudo depths versus DFC2019 ground-truth DSM on the fine-tuning scenes). RPC models are known approximations; without this check it is impossible to attribute the reported 38% MAE gain to the fine-tuning procedure rather than to the quality or bias of the supervision.
Authors: We agree that explicit quantitative validation of the pseudo-depth labels was not included in the original manuscript. Although the pseudo-depths are constructed directly from RPC geometry and stereo pairs (ensuring consistency with the camera model by design), we acknowledge that reporting error metrics against ground-truth DSMs on the fine-tuning scenes would strengthen the claim. In the revised version, we will add a new paragraph and table in §3.2 with MAE, scale bias, and residual statistics of the pseudo-depths versus DFC2019 ground truth on the training scenes. This will allow readers to evaluate the supervision quality independently. revision: yes
-
Referee: [§4.2] §4.2 (Experiments and ablations): The results claim a 38% MAE reduction and competitive accuracy, yet the text provides no error bars, multiple-run statistics, or ablation isolating the contribution of RPC-aware supervision versus standard fine-tuning. This makes it difficult to verify that the gains are robust and directly attributable to the proposed RPC-aware component.
Authors: We concur that the lack of statistical reporting and targeted ablations limits the ability to assess robustness and isolate the RPC-aware component. In the revision, we will add error bars (standard deviation across multiple training runs with different random seeds) to the main results table. We will also include a dedicated ablation subsection in §4.2 comparing (i) zero-shot baseline, (ii) standard fine-tuning without RPC awareness, and (iii) our full RPC-aware fine-tuning. This will directly demonstrate the contribution of the proposed supervision construction. revision: yes
Circularity Check
No circularity: external RPC geometry supplies independent supervision
full rationale
The paper constructs pseudo-depth labels directly from the RPC camera model (an external geometric prior) and applies standard SiLog fine-tuning to Depth Anything V2. No equation or claim reduces by construction to the model's own outputs, fitted parameters, or prior self-citations; the adaptation is a conventional transfer-learning step whose success is measured against the independent DFC2019 benchmark. The derivation chain therefore remains non-circular.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The Rational Polynomial Camera model supplies accurate geometric constraints that can be converted into metric pseudo-depth supervision for fine-tuning.
Reference graph
Works this paper leans on
-
[1]
Zoedepth: Zero-shot transfer by com- bining relative and metric depth, 2023
Shariq Farooq Bhat, Reiner Birkl, Diana Wofk, Peter Wonka, and Matthias M ¨uller. Zoedepth: Zero-shot transfer by com- bining relative and metric depth, 2023. 2, 3
work page 2023
-
[2]
Camille Billouard, Dawa Derksen, Emmanuelle Sarrazin, and Bruno Vallet. Sat-ngp : Unleashing neural graphics primitives for fast relightable transient-free 3d reconstruction from satellite imagery, 2024. 2
work page 2024
-
[3]
2d gaussian splatting for geometrically accu- rate radiance fields
Binbin Huang, Zehao Yu, Anpei Chen, Andreas Geiger, and Shenghua Gao. 2d gaussian splatting for geometrically accu- rate radiance fields. InSIGGRAPH 2024 Conference Papers. Association for Computing Machinery, 2024. 2
work page 2024
-
[4]
Lihan Jiang, Yucheng Mao, Linning Xu, Tao Lu, Kerui Ren, Yichen Jin, Xudong Xu, Mulin Yu, Jiangmiao Pang, Feng Zhao, et al. Anysplat: Feed-forward 3d gaussian splatting from unconstrained views.arXiv preprint arXiv:2505.23716,
-
[5]
Bernhard Kerbl, Georgios Kopanas, Thomas Leimk ¨uhler, and George Drettakis. 3d gaussian splatting for real-time radiance field rendering.ACM Transactions on Graphics, 42 (4), 2023. 2
work page 2023
-
[6]
Ground- ing image matching in 3d with mast3r, 2024
Vincent Leroy, Yohann Cabon, and Jerome Revaud. Ground- ing image matching in 3d with mast3r, 2024. 2
work page 2024
-
[7]
Depth Anything 3: Recovering the Visual Space from Any Views
Haotong Lin, Sili Chen, Jun Hao Liew, Donny Y . Chen, Zhenyu Li, Guang Shi, Jiashi Feng, and Bingyi Kang. Depth anything 3: recovering the visual space from any views. arXiv preprint arXiv:2511.10647, 2025. 2, 4, 7
work page internal anchor Pith review arXiv 2025
-
[8]
Tianle Liu, Shuangming Zhao, Wanshou Jiang, and Bingx- uan Guo. Sat-dn: Implicit surface reconstruction from multi- view satellite images with depth and normal supervision. arXiv preprint arXiv:2502.08352, 2025. 1, 2, 3, 4, 7
-
[9]
3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors
Xi Liu, Chaoyi Zhou, and Siyu Huang. 3dgs-enhancer: Enhancing unbounded 3d gaussian splatting with view- consistent 2d diffusion priors. InAdvances in Neural In- formation Processing Systems (NeurIPS), 2024. 2
work page 2024
-
[10]
Roger Mar ´ı, Gabriele Facciolo, and Thibaud Ehret. Sat- NeRF: Learning multi-view satellite photogrammetry with transient objects and shadow modeling using RPC cameras. In2022 IEEE/CVF Conference on Computer Vision and Pat- tern Recognition Workshops (CVPRW), pages 1310–1320,
-
[11]
Multi- date earth observation nerf: The detail is in the shadows
Roger Mar ´ı, Gabriele Facciolo, and Thibaud Ehret. Multi- date earth observation nerf: The detail is in the shadows. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2034–2044, 2023. 2
work page 2034
-
[12]
Srinivasan, Matthew Tancik, Jonathan T
Ben Mildenhall, Pratul P. Srinivasan, Matthew Tancik, Jonathan T. Barron, Ravi Ramamoorthi, and Ren Ng. Nerf: Representing scenes as neural radiance fields for view syn- thesis. InECCV, 2020. 2
work page 2020
-
[13]
Srinivasan, Peter Hedman, Ricardo Martin-Brualla, and Jonathan T
Ben Mildenhall, Dor Verbin, Pratul P. Srinivasan, Peter Hedman, Ricardo Martin-Brualla, and Jonathan T. Barron. MultiNeRF: A Code Release for Mip-NeRF 360, Ref-NeRF, and RawNeRF, 2022
work page 2022
-
[14]
Instant neural graphics primitives with a multires- olution hash encoding.ACM Trans
Thomas M ¨uller, Alex Evans, Christoph Schied, and Alexan- der Keller. Instant neural graphics primitives with a multires- olution hash encoding.ACM Trans. Graph., 41(4):102:1– 102:15, 2022. 2
work page 2022
-
[15]
Global Structure-from-Motion Revisited
Linfei Pan, Daniel Barath, Marc Pollefeys, and Jo- hannes Lutz Sch ¨onberger. Global Structure-from-Motion Revisited. InEuropean Conference on Computer Vision (ECCV), 2024. 2
work page 2024
-
[16]
Yingjie Qu and Fei Deng. Sat-mesh: Learning neural im- plicit surfaces for multi-view satellite reconstruction.Remote Sensing, 15:4297, 2023. 2
work page 2023
-
[17]
Data fusion contest 2019 (dfc2019), 2019
Bertrand Le Saux, Naoto Yokoya, Ronny H ¨ansch, and My- ron Brown. Data fusion contest 2019 (dfc2019), 2019. 1, 2, 3, 4, 5
work page 2019
-
[18]
Structure-from-motion revisited
Johannes Lutz Sch ¨onberger and Jan-Michael Frahm. Structure-from-motion revisited. InConference on Com- puter Vision and Pattern Recognition (CVPR), 2016. 2
work page 2016
-
[19]
A vote-and-verify strat- egy for fast spatial verification in image retrieval
Johannes Lutz Sch ¨onberger, True Price, Torsten Sattler, Jan- Michael Frahm, and Marc Pollefeys. A vote-and-verify strat- egy for fast spatial verification in image retrieval. InAsian Conference on Computer Vision (ACCV), 2016
work page 2016
-
[20]
Pixelwise view selection for un- structured multi-view stereo
Johannes Lutz Sch ¨onberger, Enliang Zheng, Marc Pollefeys, and Jan-Michael Frahm. Pixelwise view selection for un- structured multi-view stereo. InEuropean Conference on Computer Vision (ECCV), 2016. 2
work page 2016
-
[21]
Vggt: Visual geometry grounded transformer
Jianyuan Wang, Minghao Chen, Nikita Karaev, Andrea Vedaldi, Christian Rupprecht, and David Novotny. Vggt: Visual geometry grounded transformer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2025. 1, 2, 4, 7
work page 2025
-
[22]
arXiv preprint arXiv:2106.10689 , year=
Peng Wang, Lingjie Liu, Yuan Liu, Christian Theobalt, Taku Komura, and Wenping Wang. Neus: Learning neural implicit surfaces by volume rendering for multi-view reconstruction. arXiv preprint arXiv:2106.10689, 2021. 2
-
[23]
Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025
Qianqian Wang, Yifei Zhang, Aleksander Holynski, Alexei A Efros, and Angjoo Kanazawa. Continuous 3d perception model with persistent state.arXiv preprint arXiv:2501.12387, 2025. 2
-
[24]
Run Wang, Chaoyi Zhou, Amir Salarpour, Xi Liu, Zhi-Qi Cheng, Feng Luo, Mert D. Pes´e, and Siyu Huang. Flexmap: Generalized hd map construction from flexible camera con- figurations, 2026
work page 2026
-
[25]
Dust3r: Geometric 3d vi- sion made easy
Shuzhe Wang, Vincent Leroy, Yohann Cabon, Boris Chidlovskii, and Jerome Revaud. Dust3r: Geometric 3d vi- sion made easy. InCVPR, 2024
work page 2024
-
[26]
Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli
Jianing Yang, Alexander Sax, Kevin J. Liang, Mikael Henaff, Hao Tang, Ang Cao, Joyce Chai, Franziska Meier, and Matt Feiszli. Fast3r: Towards 3d reconstruction of 1000+ im- ages in one forward pass. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025. 2
work page 2025
-
[27]
Depth anything: Unleashing the power of large-scale unlabeled data
Lihe Yang, Bingyi Kang, Zilong Huang, Xiaogang Xu, Jiashi Feng, and Hengshuang Zhao. Depth anything: Unleashing the power of large-scale unlabeled data. InCVPR, 2024. 2
work page 2024
-
[28]
Lihe Yang, Bingyi Kang, Zilong Huang, Zhen Zhao, Xiao- gang Xu, Jiashi Feng, and Hengshuang Zhao. Depth any- thing v2.arXiv:2406.09414, 2024. 1, 2, 3, 4, 7
work page internal anchor Pith review arXiv 2024
-
[29]
Mip-splatting: Alias-free 3d gaussian splat- ting
Zehao Yu, Anpei Chen, Binbin Huang, Torsten Sattler, and Andreas Geiger. Mip-splatting: Alias-free 3d gaussian splat- ting. InProceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR), pages 19447– 19456, 2024. 2
work page 2024
-
[30]
Junyi Zhang, Charles Herrmann, Junhwa Hur, Varun Jam- pani, Trevor Darrell, Forrester Cole, Deqing Sun, and Ming- Hsuan Yang. Monst3r: A simple approach for estimat- ing geometry in the presence of motion.arXiv preprint arxiv:2410.03825, 2024. 2
-
[31]
Latent radiance fields with 3d-aware 2d representations
Chaoyi Zhou, Xi Liu, Feng Luo, and Siyu Huang. Latent radiance fields with 3d-aware 2d representations. InInter- national Conference on Learning Representations (ICLR),
-
[32]
Pes ´e, Zhiwen Fan, Yiqi Zhong, and Siyu Huang
Chaoyi Zhou, Run Wang, Feng Luo, Mert D. Pes ´e, Zhiwen Fan, Yiqi Zhong, and Siyu Huang. Ff3r: Feedforward fea- ture 3d reconstruction from unconstrained views. InCVPR Findings, 2026. 2 Max Depth Mean MAE↓Mean MED↓ 100 3.312 2.254 300 3.412 2.354 150 3.131 1.963 Table 2. Ablation study on the maximum depth threshold. Appendix In the Appendix, we provide t...
work page 2026
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.