pith. sign in

arxiv: 2511.10370 · v2 · submitted 2025-11-13 · 💻 cs.CV · cs.AI· cs.LG

SHRUG-FM: Reliability-Aware Foundation Models for Earth Observation

Pith reviewed 2026-05-17 22:17 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG
keywords SHRUG-FMgeospatial foundation modelsout-of-distribution detectionpredictive uncertaintyEarth observationreliability-aware predictionabstention policydecision tree fusion
0
0 comments X

The pith

SHRUG-FM lets geospatial foundation models abstain from unreliable predictions by fusing three signals through an interpretable decision tree.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SHRUG-FM as a way for geospatial foundation models to spot and skip predictions likely to fail in real environments not well covered during pretraining. It combines three signals: out-of-distribution detection based on geophysical properties of the input image, out-of-distribution detection in the model's internal embedding space, and the model's own task-specific predictive uncertainty. These are fed into a shallow glass-box decision tree that outputs clear abstention thresholds. The method is tested on burn scar segmentation, flood mapping, and landslide detection, where it lowers risk on the predictions the model keeps and beats single-signal approaches such as predictive entropy. The result points toward more trustworthy use of these models in disaster and climate monitoring.

Core claim

SHRUG-FM integrates geophysical OOD detection in input space, OOD detection in embedding space, and task-specific predictive uncertainty through a shallow glass-box decision tree to produce interpretable abstention policies that reduce prediction risk on retained samples across burn scar segmentation, flood mapping, and landslide detection tasks.

What carries the argument

The shallow glass-box decision tree that fuses geophysical out-of-distribution signals, embedding-space OOD signals, and predictive uncertainty to generate abstention thresholds.

If this is right

  • Reduces prediction risk on the samples the model retains compared with single-signal baselines such as predictive entropy.
  • Supplies explicit, human-readable abstention thresholds for the three rapid-mapping tasks.
  • Supports safer deployment of geospatial foundation models in climate-sensitive applications without task-specific retraining.
  • Maintains performance advantages while remaining interpretable through the decision tree structure.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same three-signal fusion could be applied to other foundation models that face distribution shifts outside Earth observation.
  • The interpretable thresholds could let end users tune the model's risk tolerance for different operational settings.
  • If the signals stay complementary at larger scales, the approach might extend to additional high-stakes mapping tasks without added model complexity.

Load-bearing premise

The three signals remain complementary enough for a shallow decision tree to combine them into a reliable abstention policy without creating new failure modes or requiring task-specific retraining.

What would settle it

On the burn scar, flood, or landslide mapping tasks, the fused SHRUG-FM system produces higher or equal risk on retained samples than using only predictive entropy alone.

Figures

Figures reproduced from arXiv: 2511.10370 by Joppe Massant, Kai-Hendrik Cohrs, Maria Gonzalez-Calabuig, Patrick Ebel, Ruben Cartuyvels, Shruti Nath, Steffen Knoblauch, Vasileios Sitokonstantinou, Vishal Nedungadi, Zuzanna Osika.

Figure 1
Figure 1. Figure 1: The SHRUG-FM framework. It computes three complementary signals: OOD detection [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: SHRUG-FM combines complementary signals to flag unreliable predictions. (a) The [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Density of distances to nearest k-means centroid for pretraining and downstream data. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of NCDD values for the downstream task overlaid on a hexagonal spatial [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Visualization of F1 scores for the downstream task, Elevation (avg), River Area and Pasture [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
read the original abstract

Geospatial foundation models (GFMs) for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that enables GFMs to identify and abstain from likely failures. Our approach integrates three complementary signals: geophysical out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space, and task-specific predictive uncertainty. We evaluate SHRUG-FM across three high-stakes rapid-mapping tasks: burn scar segmentation, flood mapping, and landslide detection. Our results show that SHRUG-FM consistently reduces prediction risk on retained samples, outperforming established single-signal baselines like predictive entropy. Crucially, by utilizing a shallow "glass-box" decision tree for signal fusion, SHRUG-FM provides interpretable abstention thresholds. It builds a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, bridging the gap between benchmark performance and real-world reliability.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims to introduce SHRUG-FM, a framework that enables geospatial foundation models to abstain from unreliable predictions by fusing three signals: geophysical out-of-distribution (OOD) detection in input space, OOD detection in embedding space, and task-specific predictive uncertainty. This fusion is performed using a shallow 'glass-box' decision tree. The framework is evaluated on three rapid-mapping tasks—burn scar segmentation, flood mapping, and landslide detection—showing consistent reduction in prediction risk on retained samples, outperforming single-signal baselines like predictive entropy, and providing interpretable abstention thresholds.

Significance. If validated, the results could have significant implications for the reliable deployment of foundation models in Earth observation, particularly in climate-sensitive and high-stakes applications such as disaster response. The emphasis on interpretability through the decision tree is a strength. The work builds on standard OOD and uncertainty signals but their integration via DT offers a practical approach. However, the significance hinges on demonstrating that the multi-signal fusion adds value beyond individual components.

major comments (2)
  1. The evaluation does not include correlation analysis between the geophysical OOD, embedding OOD, and predictive uncertainty signals, nor leave-one-signal-out ablations. This is critical because if the signals are correlated, the decision tree may not provide genuine fusion benefits, and the reported outperformance over predictive entropy could be an artifact of task-specific fitting rather than complementary information.
  2. Insufficient details are provided on the data splits used for training the decision tree, the exact metrics for risk reduction, statistical tests for significance, and whether abstention thresholds were tuned on held-out test sets. These are load-bearing for assessing if the risk reduction is robust and not post-hoc optimized.
minor comments (1)
  1. The abstract could benefit from including specific quantitative results, such as average risk reduction percentages or the number of samples retained, to better convey the magnitude of improvements.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the changes we will make in revision to improve the clarity and rigor of the evaluation.

read point-by-point responses
  1. Referee: The evaluation does not include correlation analysis between the geophysical OOD, embedding OOD, and predictive uncertainty signals, nor leave-one-signal-out ablations. This is critical because if the signals are correlated, the decision tree may not provide genuine fusion benefits, and the reported outperformance over predictive entropy could be an artifact of task-specific fitting rather than complementary information.

    Authors: We agree that explicit correlation analysis and leave-one-signal-out ablations would provide stronger evidence that the three signals are complementary rather than redundant. The current manuscript shows that SHRUG-FM outperforms the predictive-entropy baseline, but does not quantify inter-signal correlations or isolate the contribution of each signal. In the revised version we will add (i) pairwise Pearson and Spearman correlations computed on the validation sets for all three tasks and (ii) leave-one-signal-out ablation tables that report risk reduction when each signal is removed in turn. These additions will directly test whether the decision tree exploits complementary information. revision: yes

  2. Referee: Insufficient details are provided on the data splits used for training the decision tree, the exact metrics for risk reduction, statistical tests for significance, and whether abstention thresholds were tuned on held-out test sets. These are load-bearing for assessing if the risk reduction is robust and not post-hoc optimized.

    Authors: We acknowledge that the experimental section would benefit from greater specificity. In the revision we will expand the description of the protocol to state: (a) the decision tree is trained exclusively on a validation split that is disjoint from all reported test sets; (b) the risk-reduction metric is defined as the difference in expected risk (1 - F1) between the full test set and the retained subset after abstention; (c) statistical significance is assessed with paired Wilcoxon signed-rank tests across the three tasks, with p-values reported; and (d) both the decision-tree hyperparameters and the final abstention thresholds are selected by cross-validation on the training-plus-validation data only, with no access to test-set labels or performance. These clarifications will remove any ambiguity about post-hoc optimization. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical fusion evaluated on held-out tasks

full rationale

The paper introduces SHRUG-FM as a practical framework that combines three standard signals (geophysical OOD, embedding OOD, task uncertainty) via a shallow decision tree and reports empirical risk reduction on three rapid-mapping tasks. No equations, first-principles derivations, or fitted parameters are presented as predictions; results rest on direct experimental comparison against baselines on held-out data. The method is therefore self-contained against external benchmarks with no load-bearing self-citation chains or definitional reductions.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the unstated assumption that the three detection signals are complementary and that a shallow decision tree can learn stable abstention rules from them; no explicit free parameters, axioms, or invented entities are declared in the abstract.

pith-pipeline@v0.9.0 · 5511 in / 1153 out tokens · 30044 ms · 2026-05-17T22:17:25.850966+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty... shallow 'glass-box' decision tree for signal fusion

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 1 internal anchor

  1. [1]

    Founda- tion models for remote sensing and earth observation: A survey, 2025, 2410.16602

    Aoran Xiao, Weihao Xuan, Junjue Wang, Jiaxing Huang, Dacheng Tao, Shijian Lu, and Naoto Yokoya. Foundation models for remote sensing and earth observation: A survey, 2025. URL https://arxiv.org/abs/2410.16602

  2. [2]

    https://madewithclay.org/ , 2024

    Clay Foundation Model. https://madewithclay.org/ , 2024. Open-source geospatial foundation model website

  3. [3]

    Yi Wang, Nassim Ait Ali Braham, Zhitong Xiong, Chenying Liu, Conrad M Albrecht, and Xiao Xiang Zhu. Ssl4eo-s12: A large-scale multimodal, multitemporal dataset for self- supervised learning in earth observation [software and data sets].IEEE Geoscience and Remote Sensing Magazine, 11(3):98–106, 2023

  4. [4]

    Johannes Jakubik, Sujit Roy, C. E. Phillips, Paolo Fraccaro, Denys Godwin, Bianca Zadrozny, Daniela Szwarcman, Carlos Gomes, Gabby Nyirjesy, Blair Edwards, Daiki Kimura, et al. Foun- dation models for generalist geospatial artificial intelligence.arXiv preprint arXiv:2310.18660, 2023

  5. [5]

    J., F LEMING , L., AND GEACH , J

    Michael J Smith, Luke Fleming, and James E Geach. Earthpt: a time series foundation model for earth observation.arXiv preprint arXiv:2309.07207, 2023

  6. [6]

    Scale-MAE: A scale- aware masked autoencoder for multiscale geospatial representation learning

    Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-MAE: A scale- aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088–4099, 2023

  7. [7]

    Exebench: Benchmarking foundation models on extreme earth events, 2025

    Shan Zhao, Zhitong Xiong, Jie Zhao, and Xiao Xiang Zhu. Exebench: Benchmarking foundation models on extreme earth events, 2025. URLhttps://arxiv.org/abs/2505.08529

  8. [8]

    Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

    Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, Heng Fang, Yifang Ban, Maarten Ver- gauwen, Nicolas Audebert, and Andrea Nascetti. Pangaea: A global and inclusive benchmark for geospatial foundation models, 2025. URLhttps://arxiv.org/abs/2412.04204

  9. [9]

    2023), 98–106

    Yi Wang, Nassim Ait Ali Braham, Zhitong Xiong, Chenying Liu, Conrad M. Albrecht, and Xiao Xiang Zhu. SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self- supervised learning in earth observation [software and data sets].IEEE Geoscience and Remote Sensing Magazine, 11(3):98–106, 2023. doi: 10.1109/MGRS.2023.3281651

  10. [10]

    Reobench: Benchmarking robustness of earth observation foundation models, 2025

    Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, and Tianjin Huang. Reobench: Benchmarking robustness of earth observation foundation models, 2025. URLhttps://arxiv.org/abs/2505.16793

  11. [11]

    Yin-Nian Liu, De-Xin Sun, Xiao-Ning Hu, Xiang Ye, Yun-Duan Li, Shu-Feng Liu, Kai-Qin Cao, Meng-Yang Chai, Wei-Yi-Nuo Zhou, Jing Zhang, Ying Zhang, Wei-Wei Sun, and Lei-Lei Jiao

    Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Andrew Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, Mehmet Gunturkun, Gabriel Huang, David Vazquez, Dava Newman, Yoshua Bengio, Stefano Ermon, and Xiao Xiang Zhu. Geo-bench: Toward foundation models for earth monitoring, 2023. URL https://arx...

  12. [12]

    Global hydro-environmental sub-basin and river reach characteristics at high spatial resolution.Scientific Data, 6(1):283, Dec 2019

    Simon Linke, Bernhard Lehner, Camille Ouellet Dallaire, Joseph Ariwi, Günther Grill, Mira Anand, Penny Beames, Vicente Burchard-Levine, Sally Maxwell, Hana Moidu, Florence Tan, and Michele Thieme. Global hydro-environmental sub-basin and river reach characteristics at high spatial resolution.Scientific Data, 6(1):283, Dec 2019. ISSN 2052-4463. doi: 10.103...

  13. [13]

    Wildfires and global change.Frontiers in Ecology and the Environment, 19(7):387–395, 2021

    Juli G Pausas and Jon E Keeley. Wildfires and global change.Frontiers in Ecology and the Environment, 19(7):387–395, 2021

  14. [14]

    Global wildland fire management research needs.Current Forestry Reports, 5 (4):210–225, 2019

    Peter F Moore. Global wildland fire management research needs.Current Forestry Reports, 5 (4):210–225, 2019

  15. [15]

    The landscape fire scars database: mapping historical burned area and fire severity in chile.Earth System Science Data, 14(8):3599–3613, 2022

    Alejandro Miranda, Rayén Mentler, Ítalo Moletto-Lobos, Gabriela Alfaro, Leonardo Aliaga, Dana Balbontín, Maximiliano Barraza, Susanne Baumbach, Patricio Calderón, Fernando Cárde- nas, et al. The landscape fire scars database: mapping historical burned area and fire severity in chile.Earth System Science Data, 14(8):3599–3613, 2022

  16. [16]

    Ncdd: Nearest centroid distance deficit for out-of-distribution detection in gastrointestinal vision.arXiv preprint arXiv:2412.01590, 2024

    Sandesh Pokhrel, Sanjay Bhandari, Sharib Ali, Tryphon Lambrou, Anh Nguyen, Yash Raj Shrestha, Angus Watson, Danail Stoyanov, Prashnna Gyawali, and Binod Bhattarai. Ncdd: Nearest centroid distance deficit for out-of-distribution detection in gastrointestinal vision.arXiv preprint arXiv:2412.01590, 2024

  17. [17]

    What uncertainties do we need in bayesian deep learning for com- puter vision? In I

    Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for com- puter vision? In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper...

  18. [18]

    Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, 2015

    Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, 2015. URL https://arxiv.org/abs/14 12.1897

  19. [19]

    Simple and scalable predictive uncertainty estimation using deep ensembles

    Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6405–6416, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964

  20. [20]

    Dropout as a bayesian approximation: Representing model uncertainty in deep learning

    Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059, New York, New York, USA, 20–22 Jun 201...

  21. [21]

    Bias-Reduced Uncertainty Estimation for Deep Neural Classifiers

    Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers.arXiv preprint arXiv:1805.08206, 2018

  22. [22]

    Machine learning with a reject option: A survey, 2024

    Kilian Hendrickx, Lorenzo Perini, Dries Van der Plas, Wannes Meert, and Jesse Davis. Machine learning with a reject option: A survey, 2024. URL https://arxiv.org/abs/2107.11277

  23. [23]

    Global flood extent segmentation in optical satellite images.Scientific Reports, 13(1):20316, Nov 2023

    Enrique Portalés-Julià, Gonzalo Mateo-García, Cormac Purcell, and Luis Gómez-Chova. Global flood extent segmentation in optical satellite images.Scientific Reports, 13(1):20316, Nov 2023. ISSN 2045-2322. doi: 10.1038/s41598-023-47595-7. URL https://doi.org/10.1038/s4 1598-023-47595-7

  24. [24]

    Cas landslide dataset: A large-scale and multisensor dataset for deep learning-based landslide detection.Scientific Data, 11(1):12, 2024

    Yulin Xu, Chaojun Ouyang, Qingsong Xu, Dongpo Wang, Bo Zhao, and Yutao Luo. Cas landslide dataset: A large-scale and multisensor dataset for deep learning-based landslide detection.Scientific Data, 11(1):12, 2024

  25. [25]

    Terramind: Large-scale generative multimodality for earth observation.ICCV’25, 2025

    Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, et al. Terramind: Large-scale generative multimodality for earth observation.ICCV’25, 2025

  26. [26]

    Segmentation re-thinking uncertainty estimation metrics for semantic segmentation, 2024

    Qitian Ma, Shyam Nanda Rai, Carlo Masone, and Tatiana Tommasi. Segmentation re-thinking uncertainty estimation metrics for semantic segmentation, 2024. URL https://arxiv.org/ abs/2403.19826

  27. [27]

    Lieffrig, Lawrence H

    Tal Zeevi, Eléonore V . Lieffrig, Lawrence H. Staib, and John A. Onofrey. Spatially-aware evaluation of segmentation uncertainty, 2025. URL https://arxiv.org/abs/2506.16589. 6

  28. [28]

    Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2022. doi: 10.1109/CVPR52688.2022.0 1553

  29. [29]

    Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021. doi: 10.1109/ICCV48922.2021.00951

  30. [30]

    Momentum contrast for unsupervised visual representation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020. doi: 10.1109/CVPR42600.20 20.00975

  31. [31]

    Understanding contrastive versus reconstructive self-supervised learning of vision transformers

    Shashank Shekhar, Florian Bordes, Pascal Vincent, and Ari Morcos. Understanding contrastive versus reconstructive self-supervised learning of vision transformers. InSelf–Supervised Learning: Theory and Practice, Workshop at NeurIPS 2022, December 2022

  32. [32]

    Analyzing local representations of self-supervised vision transformers, 2024

    Ani Vanyan, Alvard Barseghyan, Hakob Tamazyan, Vahan Huroyan, Hrant Khachatrian, and Martin Danelljan. Analyzing local representations of self-supervised vision transformers, 2024. URLhttps://arxiv.org/abs/2401.00463

  33. [33]

    Deep ensembles secretly perform empirical bayes, 2025

    Gabriel Loaiza-Ganem, Valentin Villecroze, and Yixin Wang. Deep ensembles secretly perform empirical bayes, 2025. URLhttps://arxiv.org/abs/2501.17917

  34. [34]

    Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout.Workshop on Bayesian Deep Learning, NIPS, 2016

    Ian Osband. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout.Workshop on Bayesian Deep Learning, NIPS, 2016. URLhttps://api.semantic scholar.org/CorpusID:8985844. 7 5 Appendix 5.1 Data 5.1.1 SSL4EO SSL4EO-S12 [9] is a large-scale dataset of multi-seasonal Sentinel-1 and Sentinel-2 (S2) imagery covering ~250,000 locatio...

  35. [35]

    burn scar

    Nearest Centroid Distance Deficit (NCDD)For computing the NCDD values, we use the formulation described in [16]. N CDD=α·D others −β·D nearest where Dnearest corresponds to the distance of a test point to the nearest centroid and Dothers corresponds to the sum of its distances to all the other centroids. Since results were robust to different settings of ...

  36. [36]

    It is the ratio of the intersection of the two sets of pixels to their union

    IoU (Intersection over Union)Also known as the Jaccard Index, IoU measures the overlap between the predicted and ground truth areas. It is the ratio of the intersection of the two sets of pixels to their union. IoU= TP TP+FP+FN

  37. [37]

    F1= 2· Precision·Recall Precision+Recall = 2·TP 2·TP+FP+FN

    F1-scoreThe F1-score is the harmonic mean of Precision ( TP TP+FP) and Recall ( TP TP+FN). F1= 2· Precision·Recall Precision+Recall = 2·TP 2·TP+FP+FN

  38. [38]

    AccuracyAccuracy is the proportion of all correct predictions out of the total number of predictions. Accuracy= TP+TN TP+FP+TN+FN For unbalanced tasks, where one class is significantly more frequent than the other (e.g., a small burn scar within a large image), Accuracy can be misleading. A model that simply predicts the majority class will achieve a high...

  39. [39]

    Expected Calibration Error (ECE)ECE measures how well a model’s predicted probabilities align with the observed accuracy. The probability range is divided into a fixed number of bins and ECE is the weighted average of the absolute differences between the average predicted probability (confidence) and the actual accuracy within each bin. ECE= MX m=1 |Bm| n...

  40. [40]

    Instead of using equally-sized bins, ACE uses adaptively- sized bins so that each bin contains an approximately equal number of data points

    Adaptive Calibration Error (ACE)ACE is an improved version of ECE that addresses its limitations in handling unbalanced datasets. Instead of using equally-sized bins, ACE uses adaptively- sized bins so that each bin contains an approximately equal number of data points. This ensures that even for rare, low-confidence predictions, there are enough samples ...

  41. [41]

    It represents the ensemble’s consensus on the most likely class and the estimated aleatoric uncertainty

    Predicted ProbabilityThe predicted probability ¯pis the average of the individual model predictions. It represents the ensemble’s consensus on the most likely class and the estimated aleatoric uncertainty. ¯p= 1 N NX i=1 pi

  42. [42]

    A high value indicates a diffuse, uncertain prediction over multiple classes

    Predictive EntropyPredictive entropy H(¯p)measures the overall uncertainty of the ensemble’s average prediction. A high value indicates a diffuse, uncertain prediction over multiple classes. H(¯p) =− CX c=1 ¯pc log(¯pc) whereCis the number of classes and¯p c is the mean predicted probability for classc

  43. [43]

    It captures the spread of individual model predictions

    Predictive VariancePredictive variance V(p) quantifies the disagreement among the ensemble members. It captures the spread of individual model predictions. V(p) = 1 N NX i=1 (pi −¯p)2

  44. [44]

    It captures the epistemic uncertainty or the uncertainty due to a lack of model consensus

    Mutual InformationMutual information I measures the reduction in uncertainty about the predicted class provided by the ensemble. It captures the epistemic uncertainty or the uncertainty due to a lack of model consensus. I=H(¯p)− 1 N NX i=1 H(pi) whereH(p i)is the entropy of thei-th model’s prediction. 5.2.5 Image-level Uncertainty Metrics

  45. [45]

    For a given pixel- level uncertainty metric M(x) (e.g., Predictive Variance), the image-level uncertaintyUimage is the average of that metric over the ROI

    Uncertainty over Region of InterestTo obtain a single uncertainty value for an entire image, we aggregate a chosen pixel-level metric over a specific region of interest (ROI). For a given pixel- level uncertainty metric M(x) (e.g., Predictive Variance), the image-level uncertaintyUimage is the average of that metric over the ROI. Uimage = 1 |ROI| X x∈ROI ...

  46. [46]

    To create our ensemble, we train a set of 10 additional models with varying random seeds

    Deep EnsemblesDeep ensembles are a robust method for uncertainty estimation, as they have been shown to capture key properties of the Bayesian posterior distribution [ 33]. To create our ensemble, we train a set of 10 additional models with varying random seeds. To further enhance the diversity of the individual models and improve their uncertainty estima...

  47. [47]

    This method approximates a Bayesian neural network by introducing dropout layers during training and keeping them active during inference

    Monte Carlo DropoutAs a more computationally efficient alternative, we utilize Monte Carlo (MC) Dropout [20]. This method approximates a Bayesian neural network by introducing dropout layers during training and keeping them active during inference. Dropout layers are inserted following non-linear activation functions within multi-layer convolution blocks,...