SHRUG-FM: Reliability-Aware Foundation Models for Earth Observation
Pith reviewed 2026-05-17 22:17 UTC · model grok-4.3
The pith
SHRUG-FM lets geospatial foundation models abstain from unreliable predictions by fusing three signals through an interpretable decision tree.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SHRUG-FM integrates geophysical OOD detection in input space, OOD detection in embedding space, and task-specific predictive uncertainty through a shallow glass-box decision tree to produce interpretable abstention policies that reduce prediction risk on retained samples across burn scar segmentation, flood mapping, and landslide detection tasks.
What carries the argument
The shallow glass-box decision tree that fuses geophysical out-of-distribution signals, embedding-space OOD signals, and predictive uncertainty to generate abstention thresholds.
If this is right
- Reduces prediction risk on the samples the model retains compared with single-signal baselines such as predictive entropy.
- Supplies explicit, human-readable abstention thresholds for the three rapid-mapping tasks.
- Supports safer deployment of geospatial foundation models in climate-sensitive applications without task-specific retraining.
- Maintains performance advantages while remaining interpretable through the decision tree structure.
Where Pith is reading between the lines
- The same three-signal fusion could be applied to other foundation models that face distribution shifts outside Earth observation.
- The interpretable thresholds could let end users tune the model's risk tolerance for different operational settings.
- If the signals stay complementary at larger scales, the approach might extend to additional high-stakes mapping tasks without added model complexity.
Load-bearing premise
The three signals remain complementary enough for a shallow decision tree to combine them into a reliable abstention policy without creating new failure modes or requiring task-specific retraining.
What would settle it
On the burn scar, flood, or landslide mapping tasks, the fused SHRUG-FM system produces higher or equal risk on retained samples than using only predictive entropy alone.
Figures
read the original abstract
Geospatial foundation models (GFMs) for Earth observation often fail to perform reliably in environments underrepresented during pretraining. We introduce SHRUG-FM, a framework for reliability-aware prediction that enables GFMs to identify and abstain from likely failures. Our approach integrates three complementary signals: geophysical out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space, and task-specific predictive uncertainty. We evaluate SHRUG-FM across three high-stakes rapid-mapping tasks: burn scar segmentation, flood mapping, and landslide detection. Our results show that SHRUG-FM consistently reduces prediction risk on retained samples, outperforming established single-signal baselines like predictive entropy. Crucially, by utilizing a shallow "glass-box" decision tree for signal fusion, SHRUG-FM provides interpretable abstention thresholds. It builds a pathway toward safer and more interpretable deployment of GFMs in climate-sensitive applications, bridging the gap between benchmark performance and real-world reliability.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims to introduce SHRUG-FM, a framework that enables geospatial foundation models to abstain from unreliable predictions by fusing three signals: geophysical out-of-distribution (OOD) detection in input space, OOD detection in embedding space, and task-specific predictive uncertainty. This fusion is performed using a shallow 'glass-box' decision tree. The framework is evaluated on three rapid-mapping tasks—burn scar segmentation, flood mapping, and landslide detection—showing consistent reduction in prediction risk on retained samples, outperforming single-signal baselines like predictive entropy, and providing interpretable abstention thresholds.
Significance. If validated, the results could have significant implications for the reliable deployment of foundation models in Earth observation, particularly in climate-sensitive and high-stakes applications such as disaster response. The emphasis on interpretability through the decision tree is a strength. The work builds on standard OOD and uncertainty signals but their integration via DT offers a practical approach. However, the significance hinges on demonstrating that the multi-signal fusion adds value beyond individual components.
major comments (2)
- The evaluation does not include correlation analysis between the geophysical OOD, embedding OOD, and predictive uncertainty signals, nor leave-one-signal-out ablations. This is critical because if the signals are correlated, the decision tree may not provide genuine fusion benefits, and the reported outperformance over predictive entropy could be an artifact of task-specific fitting rather than complementary information.
- Insufficient details are provided on the data splits used for training the decision tree, the exact metrics for risk reduction, statistical tests for significance, and whether abstention thresholds were tuned on held-out test sets. These are load-bearing for assessing if the risk reduction is robust and not post-hoc optimized.
minor comments (1)
- The abstract could benefit from including specific quantitative results, such as average risk reduction percentages or the number of samples retained, to better convey the magnitude of improvements.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. We address each major comment below and outline the changes we will make in revision to improve the clarity and rigor of the evaluation.
read point-by-point responses
-
Referee: The evaluation does not include correlation analysis between the geophysical OOD, embedding OOD, and predictive uncertainty signals, nor leave-one-signal-out ablations. This is critical because if the signals are correlated, the decision tree may not provide genuine fusion benefits, and the reported outperformance over predictive entropy could be an artifact of task-specific fitting rather than complementary information.
Authors: We agree that explicit correlation analysis and leave-one-signal-out ablations would provide stronger evidence that the three signals are complementary rather than redundant. The current manuscript shows that SHRUG-FM outperforms the predictive-entropy baseline, but does not quantify inter-signal correlations or isolate the contribution of each signal. In the revised version we will add (i) pairwise Pearson and Spearman correlations computed on the validation sets for all three tasks and (ii) leave-one-signal-out ablation tables that report risk reduction when each signal is removed in turn. These additions will directly test whether the decision tree exploits complementary information. revision: yes
-
Referee: Insufficient details are provided on the data splits used for training the decision tree, the exact metrics for risk reduction, statistical tests for significance, and whether abstention thresholds were tuned on held-out test sets. These are load-bearing for assessing if the risk reduction is robust and not post-hoc optimized.
Authors: We acknowledge that the experimental section would benefit from greater specificity. In the revision we will expand the description of the protocol to state: (a) the decision tree is trained exclusively on a validation split that is disjoint from all reported test sets; (b) the risk-reduction metric is defined as the difference in expected risk (1 - F1) between the full test set and the retained subset after abstention; (c) statistical significance is assessed with paired Wilcoxon signed-rank tests across the three tasks, with p-values reported; and (d) both the decision-tree hyperparameters and the final abstention thresholds are selected by cross-validation on the training-plus-validation data only, with no access to test-set labels or performance. These clarifications will remove any ambiguity about post-hoc optimization. revision: yes
Circularity Check
No circularity: empirical fusion evaluated on held-out tasks
full rationale
The paper introduces SHRUG-FM as a practical framework that combines three standard signals (geophysical OOD, embedding OOD, task uncertainty) via a shallow decision tree and reports empirical risk reduction on three rapid-mapping tasks. No equations, first-principles derivations, or fitted parameters are presented as predictions; results rest on direct experimental comparison against baselines on held-out data. The method is therefore self-contained against external benchmarks with no load-bearing self-citation chains or definitional reductions.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
integrates three complementary signals: out-of-distribution (OOD) detection in the input space, OOD detection in the embedding space and task-specific predictive uncertainty... shallow 'glass-box' decision tree for signal fusion
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Founda- tion models for remote sensing and earth observation: A survey, 2025, 2410.16602
Aoran Xiao, Weihao Xuan, Junjue Wang, Jiaxing Huang, Dacheng Tao, Shijian Lu, and Naoto Yokoya. Foundation models for remote sensing and earth observation: A survey, 2025. URL https://arxiv.org/abs/2410.16602
-
[2]
https://madewithclay.org/ , 2024
Clay Foundation Model. https://madewithclay.org/ , 2024. Open-source geospatial foundation model website
work page 2024
-
[3]
Yi Wang, Nassim Ait Ali Braham, Zhitong Xiong, Chenying Liu, Conrad M Albrecht, and Xiao Xiang Zhu. Ssl4eo-s12: A large-scale multimodal, multitemporal dataset for self- supervised learning in earth observation [software and data sets].IEEE Geoscience and Remote Sensing Magazine, 11(3):98–106, 2023
work page 2023
-
[4]
Johannes Jakubik, Sujit Roy, C. E. Phillips, Paolo Fraccaro, Denys Godwin, Bianca Zadrozny, Daniela Szwarcman, Carlos Gomes, Gabby Nyirjesy, Blair Edwards, Daiki Kimura, et al. Foun- dation models for generalist geospatial artificial intelligence.arXiv preprint arXiv:2310.18660, 2023
-
[5]
J., F LEMING , L., AND GEACH , J
Michael J Smith, Luke Fleming, and James E Geach. Earthpt: a time series foundation model for earth observation.arXiv preprint arXiv:2309.07207, 2023
-
[6]
Scale-MAE: A scale- aware masked autoencoder for multiscale geospatial representation learning
Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-MAE: A scale- aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088–4099, 2023
work page 2023
-
[7]
Exebench: Benchmarking foundation models on extreme earth events, 2025
Shan Zhao, Zhitong Xiong, Jie Zhao, and Xiao Xiang Zhu. Exebench: Benchmarking foundation models on extreme earth events, 2025. URLhttps://arxiv.org/abs/2505.08529
-
[8]
Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, Heng Fang, Yifang Ban, Maarten Ver- gauwen, Nicolas Audebert, and Andrea Nascetti. Pangaea: A global and inclusive benchmark for geospatial foundation models, 2025. URLhttps://arxiv.org/abs/2412.04204
-
[9]
Yi Wang, Nassim Ait Ali Braham, Zhitong Xiong, Chenying Liu, Conrad M. Albrecht, and Xiao Xiang Zhu. SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self- supervised learning in earth observation [software and data sets].IEEE Geoscience and Remote Sensing Magazine, 11(3):98–106, 2023. doi: 10.1109/MGRS.2023.3281651
-
[10]
Reobench: Benchmarking robustness of earth observation foundation models, 2025
Xiang Li, Yong Tao, Siyuan Zhang, Siwei Liu, Zhitong Xiong, Chunbo Luo, Lu Liu, Mykola Pechenizkiy, Xiao Xiang Zhu, and Tianjin Huang. Reobench: Benchmarking robustness of earth observation foundation models, 2025. URLhttps://arxiv.org/abs/2505.16793
-
[11]
Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Andrew Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, Mehmet Gunturkun, Gabriel Huang, David Vazquez, Dava Newman, Yoshua Bengio, Stefano Ermon, and Xiao Xiang Zhu. Geo-bench: Toward foundation models for earth monitoring, 2023. URL https://arx...
-
[12]
Simon Linke, Bernhard Lehner, Camille Ouellet Dallaire, Joseph Ariwi, Günther Grill, Mira Anand, Penny Beames, Vicente Burchard-Levine, Sally Maxwell, Hana Moidu, Florence Tan, and Michele Thieme. Global hydro-environmental sub-basin and river reach characteristics at high spatial resolution.Scientific Data, 6(1):283, Dec 2019. ISSN 2052-4463. doi: 10.103...
work page doi:10.1038/s4 2019
-
[13]
Wildfires and global change.Frontiers in Ecology and the Environment, 19(7):387–395, 2021
Juli G Pausas and Jon E Keeley. Wildfires and global change.Frontiers in Ecology and the Environment, 19(7):387–395, 2021
work page 2021
-
[14]
Global wildland fire management research needs.Current Forestry Reports, 5 (4):210–225, 2019
Peter F Moore. Global wildland fire management research needs.Current Forestry Reports, 5 (4):210–225, 2019
work page 2019
-
[15]
Alejandro Miranda, Rayén Mentler, Ítalo Moletto-Lobos, Gabriela Alfaro, Leonardo Aliaga, Dana Balbontín, Maximiliano Barraza, Susanne Baumbach, Patricio Calderón, Fernando Cárde- nas, et al. The landscape fire scars database: mapping historical burned area and fire severity in chile.Earth System Science Data, 14(8):3599–3613, 2022
work page 2022
-
[16]
Sandesh Pokhrel, Sanjay Bhandari, Sharib Ali, Tryphon Lambrou, Anh Nguyen, Yash Raj Shrestha, Angus Watson, Danail Stoyanov, Prashnna Gyawali, and Binod Bhattarai. Ncdd: Nearest centroid distance deficit for out-of-distribution detection in gastrointestinal vision.arXiv preprint arXiv:2412.01590, 2024
-
[17]
What uncertainties do we need in bayesian deep learning for com- puter vision? In I
Alex Kendall and Yarin Gal. What uncertainties do we need in bayesian deep learning for com- puter vision? In I. Guyon, U. V on Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,Advances in Neural Information Processing Systems, volume 30. Curran Associates, Inc., 2017. URL https://proceedings.neurips.cc/paper_files/paper...
work page 2017
-
[18]
Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, 2015
Anh Nguyen, Jason Yosinski, and Jeff Clune. Deep neural networks are easily fooled: High confidence predictions for unrecognizable images, 2015. URL https://arxiv.org/abs/14 12.1897
work page 2015
-
[19]
Simple and scalable predictive uncertainty estimation using deep ensembles
Balaji Lakshminarayanan, Alexander Pritzel, and Charles Blundell. Simple and scalable predictive uncertainty estimation using deep ensembles. InProceedings of the 31st International Conference on Neural Information Processing Systems, NIPS’17, page 6405–6416, Red Hook, NY , USA, 2017. Curran Associates Inc. ISBN 9781510860964
work page 2017
-
[20]
Dropout as a bayesian approximation: Representing model uncertainty in deep learning
Yarin Gal and Zoubin Ghahramani. Dropout as a bayesian approximation: Representing model uncertainty in deep learning. In Maria Florina Balcan and Kilian Q. Weinberger, editors,Proceedings of The 33rd International Conference on Machine Learning, volume 48 of Proceedings of Machine Learning Research, pages 1050–1059, New York, New York, USA, 20–22 Jun 201...
work page 2016
-
[21]
Bias-Reduced Uncertainty Estimation for Deep Neural Classifiers
Yonatan Geifman, Guy Uziel, and Ran El-Yaniv. Bias-reduced uncertainty estimation for deep neural classifiers.arXiv preprint arXiv:1805.08206, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[22]
Machine learning with a reject option: A survey, 2024
Kilian Hendrickx, Lorenzo Perini, Dries Van der Plas, Wannes Meert, and Jesse Davis. Machine learning with a reject option: A survey, 2024. URL https://arxiv.org/abs/2107.11277
-
[23]
Enrique Portalés-Julià, Gonzalo Mateo-García, Cormac Purcell, and Luis Gómez-Chova. Global flood extent segmentation in optical satellite images.Scientific Reports, 13(1):20316, Nov 2023. ISSN 2045-2322. doi: 10.1038/s41598-023-47595-7. URL https://doi.org/10.1038/s4 1598-023-47595-7
-
[24]
Yulin Xu, Chaojun Ouyang, Qingsong Xu, Dongpo Wang, Bo Zhao, and Yutao Luo. Cas landslide dataset: A large-scale and multisensor dataset for deep learning-based landslide detection.Scientific Data, 11(1):12, 2024
work page 2024
-
[25]
Terramind: Large-scale generative multimodality for earth observation.ICCV’25, 2025
Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, et al. Terramind: Large-scale generative multimodality for earth observation.ICCV’25, 2025
work page 2025
-
[26]
Segmentation re-thinking uncertainty estimation metrics for semantic segmentation, 2024
Qitian Ma, Shyam Nanda Rai, Carlo Masone, and Tatiana Tommasi. Segmentation re-thinking uncertainty estimation metrics for semantic segmentation, 2024. URL https://arxiv.org/ abs/2403.19826
-
[27]
Tal Zeevi, Eléonore V . Lieffrig, Lawrence H. Staib, and John A. Onofrey. Spatially-aware evaluation of segmentation uncertainty, 2025. URL https://arxiv.org/abs/2506.16589. 6
-
[28]
Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. In2022 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 15979–15988, 2022. doi: 10.1109/CVPR52688.2022.0 1553
-
[29]
Rombach, R., Blattmann, A., Lorenz, D., Esser, P., and Ommer, B
Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jegou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In2021 IEEE/CVF International Conference on Computer Vision (ICCV), pages 9630–9640, 2021. doi: 10.1109/ICCV48922.2021.00951
-
[30]
Momentum contrast for unsupervised visual representation learning
Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual representation learning. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9726–9735, 2020. doi: 10.1109/CVPR42600.20 20.00975
-
[31]
Understanding contrastive versus reconstructive self-supervised learning of vision transformers
Shashank Shekhar, Florian Bordes, Pascal Vincent, and Ari Morcos. Understanding contrastive versus reconstructive self-supervised learning of vision transformers. InSelf–Supervised Learning: Theory and Practice, Workshop at NeurIPS 2022, December 2022
work page 2022
-
[32]
Analyzing local representations of self-supervised vision transformers, 2024
Ani Vanyan, Alvard Barseghyan, Hakob Tamazyan, Vahan Huroyan, Hrant Khachatrian, and Martin Danelljan. Analyzing local representations of self-supervised vision transformers, 2024. URLhttps://arxiv.org/abs/2401.00463
-
[33]
Deep ensembles secretly perform empirical bayes, 2025
Gabriel Loaiza-Ganem, Valentin Villecroze, and Yixin Wang. Deep ensembles secretly perform empirical bayes, 2025. URLhttps://arxiv.org/abs/2501.17917
-
[34]
Ian Osband. Risk versus uncertainty in deep learning: Bayes, bootstrap and the dangers of dropout.Workshop on Bayesian Deep Learning, NIPS, 2016. URLhttps://api.semantic scholar.org/CorpusID:8985844. 7 5 Appendix 5.1 Data 5.1.1 SSL4EO SSL4EO-S12 [9] is a large-scale dataset of multi-seasonal Sentinel-1 and Sentinel-2 (S2) imagery covering ~250,000 locatio...
work page 2016
-
[35]
Nearest Centroid Distance Deficit (NCDD)For computing the NCDD values, we use the formulation described in [16]. N CDD=α·D others −β·D nearest where Dnearest corresponds to the distance of a test point to the nearest centroid and Dothers corresponds to the sum of its distances to all the other centroids. Since results were robust to different settings of ...
-
[36]
It is the ratio of the intersection of the two sets of pixels to their union
IoU (Intersection over Union)Also known as the Jaccard Index, IoU measures the overlap between the predicted and ground truth areas. It is the ratio of the intersection of the two sets of pixels to their union. IoU= TP TP+FP+FN
-
[37]
F1= 2· Precision·Recall Precision+Recall = 2·TP 2·TP+FP+FN
F1-scoreThe F1-score is the harmonic mean of Precision ( TP TP+FP) and Recall ( TP TP+FN). F1= 2· Precision·Recall Precision+Recall = 2·TP 2·TP+FP+FN
-
[38]
AccuracyAccuracy is the proportion of all correct predictions out of the total number of predictions. Accuracy= TP+TN TP+FP+TN+FN For unbalanced tasks, where one class is significantly more frequent than the other (e.g., a small burn scar within a large image), Accuracy can be misleading. A model that simply predicts the majority class will achieve a high...
-
[39]
Expected Calibration Error (ECE)ECE measures how well a model’s predicted probabilities align with the observed accuracy. The probability range is divided into a fixed number of bins and ECE is the weighted average of the absolute differences between the average predicted probability (confidence) and the actual accuracy within each bin. ECE= MX m=1 |Bm| n...
-
[40]
Adaptive Calibration Error (ACE)ACE is an improved version of ECE that addresses its limitations in handling unbalanced datasets. Instead of using equally-sized bins, ACE uses adaptively- sized bins so that each bin contains an approximately equal number of data points. This ensures that even for rare, low-confidence predictions, there are enough samples ...
-
[41]
Predicted ProbabilityThe predicted probability ¯pis the average of the individual model predictions. It represents the ensemble’s consensus on the most likely class and the estimated aleatoric uncertainty. ¯p= 1 N NX i=1 pi
-
[42]
A high value indicates a diffuse, uncertain prediction over multiple classes
Predictive EntropyPredictive entropy H(¯p)measures the overall uncertainty of the ensemble’s average prediction. A high value indicates a diffuse, uncertain prediction over multiple classes. H(¯p) =− CX c=1 ¯pc log(¯pc) whereCis the number of classes and¯p c is the mean predicted probability for classc
-
[43]
It captures the spread of individual model predictions
Predictive VariancePredictive variance V(p) quantifies the disagreement among the ensemble members. It captures the spread of individual model predictions. V(p) = 1 N NX i=1 (pi −¯p)2
-
[44]
It captures the epistemic uncertainty or the uncertainty due to a lack of model consensus
Mutual InformationMutual information I measures the reduction in uncertainty about the predicted class provided by the ensemble. It captures the epistemic uncertainty or the uncertainty due to a lack of model consensus. I=H(¯p)− 1 N NX i=1 H(pi) whereH(p i)is the entropy of thei-th model’s prediction. 5.2.5 Image-level Uncertainty Metrics
-
[45]
Uncertainty over Region of InterestTo obtain a single uncertainty value for an entire image, we aggregate a chosen pixel-level metric over a specific region of interest (ROI). For a given pixel- level uncertainty metric M(x) (e.g., Predictive Variance), the image-level uncertaintyUimage is the average of that metric over the ROI. Uimage = 1 |ROI| X x∈ROI ...
-
[46]
To create our ensemble, we train a set of 10 additional models with varying random seeds
Deep EnsemblesDeep ensembles are a robust method for uncertainty estimation, as they have been shown to capture key properties of the Bayesian posterior distribution [ 33]. To create our ensemble, we train a set of 10 additional models with varying random seeds. To further enhance the diversity of the individual models and improve their uncertainty estima...
-
[47]
Monte Carlo DropoutAs a more computationally efficient alternative, we utilize Monte Carlo (MC) Dropout [20]. This method approximates a Bayesian neural network by introducing dropout layers during training and keeping them active during inference. Dropout layers are inserted following non-linear activation functions within multi-layer convolution blocks,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.