EarthShift: a benchmark for measuring robustness to real-world distribution shifts in Earth observation

Hannah Kerner; Kelsey Doerksen

arxiv: 2605.29330 · v1 · pith:6NL4XQKJnew · submitted 2026-05-28 · 💻 cs.CV

EarthShift: a benchmark for measuring robustness to real-world distribution shifts in Earth observation

Kelsey Doerksen , Hannah Kerner This is my paper

Pith reviewed 2026-06-29 08:39 UTC · model grok-4.3

classification 💻 cs.CV

keywords Earth observationdistribution shiftsmodel robustnessgeospatial foundation modelsremote sensingbenchmarkout-of-distribution performance

0 comments

The pith

Geospatial foundation models perform 15-20% worse out-of-distribution on Earth observation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents EarthShift as the first public benchmark for testing how well Earth observation models handle realistic distribution shifts such as changes in time, geography, scale, and sensors. Through experiments on eight geospatial foundation models across eleven tasks and five shift types, it finds a consistent average performance decline of 15-20% when moving to out-of-distribution data. This decline occurs independently of the model's architecture, size, pre-training method, or fine-tuning approach. The results indicate that current models struggle with the variability encountered in real deployments. EarthShift allows direct measurement of this robustness gap using paired datasets.

Core claim

EarthShift enables measuring distributional robustness by comparing in- and out-of-distribution performance on paired datasets differing in sources, temporal windows, geographic locations, and sensors. Experiments demonstrate that eight geospatial foundation models perform 15-20% worse out-of-distribution on average across 11 tasks, with robustness levels comparable to generic vision foundation models and fully-supervised models.

What carries the argument

The EarthShift benchmark, which uses paired datasets from different sources, temporal windows, geographic locations, and sensors to compare in- and out-of-distribution performance.

If this is right

Research should aim to improve distributional robustness in addition to in-distribution performance for foundation models.
Robustness levels are similar between geospatial foundation models, generic vision models, and fully-supervised models.
EarthShift provides a testbed for developing more reliable models for real-world remote sensing applications.
The need for robustness improvements is highlighted as a key direction for future work.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Improving robustness could enhance the reliability of models in dynamic applications such as climate change tracking.
The benchmark could be used to test specific techniques for mitigating particular shift types like sensor or temporal changes.
Similar benchmarks might be valuable in other domains facing distribution shifts in imagery data.

Load-bearing premise

The paired datasets from different sources, temporal windows, geographic locations, and sensors accurately capture the distribution shifts models will encounter in real-world deployment scenarios.

What would settle it

A new model achieving similar performance on both in-distribution and out-of-distribution pairs within EarthShift would challenge the claim of a general robustness deficit.

Figures

Figures reproduced from arXiv: 2605.29330 by Hannah Kerner, Kelsey Doerksen.

**Figure 2.** Figure 2: Effective distribution shift analysis for (left) frozen backbone and (right) full fine-tuning. [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution shift analysis for (a) frozen backbone and (b) full fine-tuning as a function of [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

**Figure 4.** Figure 4: Model-type comparison of absolute difference in OOD-ID test set performance per shift [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

**Figure 5.** Figure 5: Comparison between full fine-tuning and frozen backbone for absolute performance [PITH_FULL_IMAGE:figures/full_fig_p018_5.png] view at source ↗

**Figure 6.** Figure 6: Model performance as a function of size. We explore the mean performance delta (OOD-ID score) per model as a function of its backbone parameter size. We show that there is no correlation between model size and mean distributional robustness across our tasks.Model Parameter size vs In Distribution Performance. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

read the original abstract

Current Earth observation benchmarks focus on measuring performance on diverse tasks and applications, typically measuring generalization in-distribution. But when models are deployed, they must generalize to myriad out-of-distribution scenarios, such as new time periods, geographies, scales, and sensors. We introduce EarthShift: the first public testbed for benchmarking robustness across multiple realistic distribution shifts encountered in remote sensing. EarthShift enables users to measure distributional robustness by comparing performance in- and out-of-distribution using datasets from paired datasets from different sources, temporal windows, geographic locations, and sensors. Our experiments on 8 geospatial foundation models (GFMs) and 11 tasks covering 5 shift types show that GFMs consistently perform 15-20% worse out-of-distribution on average regardless of model architecture, size, pre-training or fine-tuning strategy. We show that GFM robustness is similar to that of generic vision foundation models, and even fully-supervised models. This highlights a need for future research to strive for improvements in distributional robustness, not just performance, which can be benchmarked using EarthShift. We release our code and datasets to provide a testbed to guide future work to create foundation models that are robust and reliable in real-world applications. Code and data for EarthShift are available at: https://earthshift.github.io

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EarthShift gives the field a usable public testbed for real distribution shifts in remote sensing, but the headline 15-20% OOD drop claim rests on paired datasets whose other differences are not yet shown to be controlled.

read the letter

The paper's main contribution is a released benchmark that pairs real remote-sensing datasets across temporal, geographic, sensor, and source shifts, then measures how eight geospatial foundation models perform on eleven tasks. The consistent 15-20% OOD drop, independent of architecture or training strategy, is the central empirical result, and the authors make the datasets and code public at earthshift.github.io.

This is useful because most existing Earth-observation benchmarks stay in-distribution; having matched real pairs for multiple shift types lets people test robustness directly. The work also shows that the robustness gap looks similar for generic vision models and even fully supervised baselines, which narrows the claim to something about the data rather than model class.

The soft spot is the dataset pairing itself. The abstract describes pairs drawn from different sources, times, locations, and sensors, yet gives no numbers on whether resolution, label noise, or class balance are matched within each ID/OOD pair. If those factors differ systematically, the measured gap cannot be attributed cleanly to the intended distribution shifts. The stress-test note flags exactly this issue, and nothing in the provided abstract rules it out.

The paper is aimed at researchers who build or deploy models for environmental monitoring and need a concrete way to measure real-world robustness. Anyone already working on geospatial foundation models or OOD evaluation will find the testbed worth looking at.

It should go to peer review. The benchmark construction is a concrete, shareable artifact even if the performance numbers need tighter validation on the pairing controls.

Referee Report

1 major / 1 minor

Summary. The paper introduces EarthShift, the first public benchmark for measuring robustness of geospatial foundation models (GFMs) to realistic distribution shifts in Earth observation. It constructs paired in-distribution (ID) and out-of-distribution (OOD) datasets differing in sources, temporal windows, geographic locations, and sensors across 5 shift types. Experiments on 8 GFMs and 11 tasks report a consistent 15-20% average performance drop OOD, independent of architecture, size, pre-training, or fine-tuning, and comparable to generic vision models and fully supervised baselines. The work releases code and datasets to support future robustness research.

Significance. If the reported OOD drops can be attributed to the intended shifts, the benchmark provides a valuable, reproducible testbed that shifts focus from in-distribution accuracy to distributional robustness in remote sensing. The public release of paired datasets, code, and the testbed itself is a concrete strength that enables community follow-up and falsifiable comparisons.

major comments (1)

[Dataset construction / Experiments] § on dataset construction and pairing (described in abstract and methods): The central claim of a 15-20% OOD drop 'regardless of model architecture, size, pre-training or fine-tuning strategy' requires that the measured gaps arise from the intended shifts rather than incidental mismatches. No quantitative controls, statistics, or ablations are reported verifying that spatial resolution, class balance, or annotation quality are matched within each ID/OOD pair. Without such evidence, the performance gap cannot be confidently attributed to distributional robustness alone.

minor comments (1)

[Abstract] Abstract: the 15-20% figure should explicitly state the performance metric (e.g., accuracy, mIoU, F1) and the precise aggregation method across the 11 tasks.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the major comment below.

read point-by-point responses

Referee: [Dataset construction / Experiments] § on dataset construction and pairing (described in abstract and methods): The central claim of a 15-20% OOD drop 'regardless of model architecture, size, pre-training or fine-tuning strategy' requires that the measured gaps arise from the intended shifts rather than incidental mismatches. No quantitative controls, statistics, or ablations are reported verifying that spatial resolution, class balance, or annotation quality are matched within each ID/OOD pair. Without such evidence, the performance gap cannot be confidently attributed to distributional robustness alone.

Authors: We agree that explicit verification is required to attribute the observed gaps to the intended distribution shifts. In the revised manuscript we will add a new subsection with quantitative controls for each ID/OOD pair, reporting (i) spatial-resolution statistics (mean, std, and range in meters), (ii) class-balance ratios and Earth-mover distance between label distributions, and (iii) annotation-quality proxies (e.g., inter-source label agreement or expert review scores where available). These statistics will be computed directly from the released paired datasets and will be accompanied by a short ablation confirming that performance gaps remain after subsampling to enforce exact resolution and class-balance matching. We believe this addition will allow confident attribution to distributional robustness. revision: yes

Circularity Check

0 steps flagged

No circularity: results are direct empirical measurements on released paired datasets

full rationale

The paper constructs paired ID/OOD datasets from different sources, times, locations, and sensors, then reports measured performance gaps (15-20% OOD drop) across 8 GFMs and 11 tasks. These are straightforward empirical computations with no equations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations. The central claim reduces only to running models on the benchmark data and averaging accuracies, which is self-contained against external benchmarks and does not reduce to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical benchmark introduction paper. No free parameters, mathematical axioms, or invented entities are introduced; the work relies on standard machine learning evaluation practices and publicly sourced remote sensing datasets.

pith-pipeline@v0.9.1-grok · 5759 in / 1039 out tokens · 21032 ms · 2026-06-29T08:39:32.987678+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 6 canonical work pages · 1 internal anchor

[1]

Domain adaptation for the classification of remote sensing data: An overview of recent advances.IEEE Geoscience and Remote Sensing Magazine, 4(2):41–57, 2016

Devis Tuia, Claudio Persello, and Lorenzo Bruzzone. Domain adaptation for the classification of remote sensing data: An overview of recent advances.IEEE Geoscience and Remote Sensing Magazine, 4(2):41–57, 2016

2016
[2]

Position: Mission critical–satellite data is a distinct modality in machine learning

Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Position: Mission critical–satellite data is a distinct modality in machine learning. InForty-first International Conference on Machine Learning
[3]

Geo-bench-2: From performance to capability, rethinking evaluation in geospatial ai, 2026

Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel, Salman Khan, Manil Maskey, Nicolas Longepe, Xiao Xiang Zhu, Hannah Kerner, Juan Bernabe- Moreno, and Alexandre Lacoste. Geo-bench-2: From performance to capability, rethinking evaluation in geospatial ai, 2026

2026
[4]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InProceedings of the International Conference on Learning Representations, 2019

2019
[5]

Temme, Jonas Rauber, Heiko H

Robert Geirhos, Carlos R. Temme, Jonas Rauber, Heiko H. Schütt, Matthias Bethge, and Felix A. Wichmann. Generalisation in humans and deep neural networks.Advances in Neural Information Processing Systems, 31, 2018

2018
[6]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InProceedings of the International Conference on Learning Representations, 2017

2017
[7]

Wichmann, and Wieland Brendel

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. InProceedings of the International Conference on Learning Representations, 2018

2018
[8]

Arbitrary style transfer in real-time with adaptive instance normalization

Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), October 2017

2017
[9]

Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

2016
[10]

Deep coral: Correlation alignment for deep domain adaptation

Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. InEuropean Conference on Computer Vision (ECCV), pages 443–450. Springer, 2016

2016
[11]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization. InarXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907
[12]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations (ICLR), 2021

2021
[13]

WILDS: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weixin Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. InInternational Conference on Machine Learning, pages 5637–5664. PMLR, 2021

2021
[14]

Evaluating machine accuracy on ImageNet

Vaishaal Shankar, Rebecca Roelofs, Horia Mania, Alex Fang, Benjamin Recht, and Ludwig Schmidt. Evaluating machine accuracy on ImageNet. InInternational Conference on Machine Learning, pages 8634–8644. PMLR, 2020. 10

2020
[15]

ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models.Advances in Neural Information Processing Systems, 32, 2019

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Chenyun Wang, Dan Gutfreund, Joshua Tenenbaum, and Boris Katz. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models.Advances in Neural Information Processing Systems, 32, 2019

2019
[16]

Sahil Sachdeva, Ivan Lopez, Chandrashekhar Biradar, and David Lobell. A distribution shift benchmark for small-holder agroforestry: Do foundation models improve geographic gener- alization? InNeurIPS 2024 Workshop on Tackling Climate Change with Machine Learning, 2024

2024
[17]

Measuring robustness to natural distribution shifts in image classification

Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. InAdvances in Neural Information Processing Systems, volume 33, pages 18583–18599, 2020

2020
[18]

Assessing out-of-domain generalization for robust building damage detection.arXiv preprint, 2020

Veit Benson and Alexander Ecker. Assessing out-of-domain generalization for robust building damage detection.arXiv preprint, 2020

2020
[19]

Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with limited labels

Hannah Kerner, Siddharth Sundar, and Mohit Satish. Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with limited labels. InAAAI Conference on Artificial Intelligence (AAAI) Workshops, 2023

2023
[20]

Geocrossbench: Cross-band generalization for remote sensing.arXiv preprint arXiv:2511.02831, 2025

Hakob Tamazyan, Ani Vanyan, Alvard Barseghyan, Anna Khosrovyan, Evan Shelhamer, and Hrant Khachatrian. Geocrossbench: Cross-band generalization for remote sensing.arXiv preprint arXiv:2511.02831, 2025

work page arXiv 2025
[21]

Earthnets: Empowering AI in earth observation.arXiv preprint arXiv:2210.04936, 2022

Zhitong Xiong, Fahong Zhang, Yi Wang, Yilei Shi, and Xiao Xiang Zhu. Earthnets: Empowering AI in earth observation.arXiv preprint arXiv:2210.04936, 2022

work page arXiv 2022
[22]

Geo-bench: Toward foundation models for earth monitoring, 2023

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Andrew Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, Mehmet Gunturkun, Gabriel Huang, David Vazquez, Dava Newman, Yoshua Bengio, Stefano Ermon, and Xiao Xiang Zhu. Geo-bench: Toward foundation models for earth monitoring, 2023

2023
[23]

Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

2017
[24]

Bag-of-visual-words and spatial extensions for land-use classifi- cation

Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classifi- cation. InACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2010

2010
[25]

Lavista Ferres, and Jennifer Marcus

Hannah Kerner, Snehal Chaudhari, Aninda Ghosh, Caleb Robinson, Adeel Ahmad, Eddie Choi, Nathan Jacobs, Chris Holmes, Matthias Mohr, Rahul Dodhia, Juan M. Lavista Ferres, and Jennifer Marcus. Fields of the world: A machine learning benchmark dataset for global agricultural field boundary segmentation, 2024

2024
[26]

reBEN: Refined BigEarthNet dataset for remote sensing image analysis.arXiv preprint arXiv:2407.03653, 2024

Kai Norman Clasen, Leonard Hackel, Tom Burgert, Gencer Sumbul, Begüm Demir, and V olker Markl. reBEN: Refined BigEarthNet dataset for remote sensing image analysis.arXiv preprint arXiv:2407.03653, 2024

work page arXiv 2024
[27]

Sen1floods11: a georeferenced dataset to train and test deep learning flood algorithms for sentinel-1

Derrick Bonafilia, Beth Tellman, Tyler Anderson, and Erica Issenberg. Sen1floods11: a georeferenced dataset to train and test deep learning flood algorithms for sentinel-1. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 835–845, 2020

2020
[28]

Do imagenet classifiers generalize to imagenet?, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet?, 2019

2019
[29]

Deepglobe 2018: A challenge to parse the earth through satellite images

Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018

2018
[30]

Data fusion contest 2022 (dfc2022), January 2022

Ronny Hänsch, Claudio Persello, Gemine Vivone, Javiera Castillo Navarro, Alexandre Boulch, Sebastien Lefevre, and Bertrand Le Saux. Data fusion contest 2022 (dfc2022), January 2022. 11

2022
[31]

Neural plasticity-inspired foundation model for observing the Earth crossing modalities.arXiv preprint arXiv:2403.15356, 2024

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired foundation model for observing the Earth crossing modalities.arXiv preprint arXiv:2403.15356, 2024

work page arXiv 2024
[32]

Anthony Fuller, Koreen Millard, and James R. Green. Croma: Remote sensing representations with contrastive radar-optical masked autoencoders, 2023

2023
[33]

Clay foundation model: An open source ai model for earth, 2024

Clay Foundation. Clay foundation model: An open source ai model for earth, 2024. Apache-2.0 License

2024
[34]

Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications, 2025

Daniela Szwarcman, Sujit Roy, Paolo Fraccaro, Þorsteinn Elí Gíslason, Benedikt Blumenstiel, Rinki Ghosal, Pedro Henrique de Oliveira, Joao Lucas de Sousa Almeida, Rocco Sedona, Yanghui Kang, Srija Chakraborty, Sizhe Wang, Carlos Gomes, Ankur Kumar, Myscon Truong, Denys Godwin, Hyunho Lee, Chia-Yu Hsu, Ata Akbari Asanjan, Besart Mujeci, Disha Shid- ham, Tr...

2025
[35]

Terrafm: A scalable foundation model for unified multisensor earth observation.arXiv preprint arXiv:2506.06281, 2025

Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muham- mad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan, and Salman Khan. Terrafm: A scalable foundation model for unified multisensor earth observation.arXiv preprint arXiv:2506.06281, 2025

work page arXiv 2025
[36]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

2025
[37]

Green, Evan Shelhamer, Hannah Kerner, and David Rolnick

Gabriel Tseng, Anthony Fuller, Marlena Reil, Henry Herzog, Patrick Beukema, Favyen Bastani, James R. Green, Evan Shelhamer, Hannah Kerner, and David Rolnick. Galileo: Learning global and local features of many remote sensing modalities, 2025

2025
[38]

Terramind: Large-scale generative multimodality for earth observation, 2025

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe- Moreno, and Nicolas Longépé. Terramind: Large-scale generative multimodality for earth observation, 2025

2025
[39]

Deep residual learning for image recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015

2015
[40]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

2021
[41]

Ramesh, Gabriel Goh, Sandish Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandish Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

2021
[42]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019
[43]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. IEEE, 2009

2009
[44]

S. T. Brown, P. Buitrago, E. Hanna, S. Sanielevici, R. Scibek, and N. A. Nystrom. Bridges-2: A platform for rapidly-evolving and data intensive research. InPractice and Experience in Advanced Research Computing, pages 1–4, 2021. 12

2021
[45]

licence ouverte

Douglas M. Jennewein, Johnathan Lee, Chris Kurtz, Will Dizon, Ian Shaeffer, Alan Chapman, Alejandro Chiquete, Josh Burks, Amber Carlson, Natalie Mason, Arhat Kobwala, Thirugnanam Jagadeesan, Praful Barghav, Torey Battelle, Rebecca Belshe, Debra McCaffrey, Marisa Brazil, Chaitanya Inumella, Kirby Kuznia, Jade Buzinski, Sean Dudley, Dhruvil Shah, Gil Speyer...

2023

[1] [1]

Domain adaptation for the classification of remote sensing data: An overview of recent advances.IEEE Geoscience and Remote Sensing Magazine, 4(2):41–57, 2016

Devis Tuia, Claudio Persello, and Lorenzo Bruzzone. Domain adaptation for the classification of remote sensing data: An overview of recent advances.IEEE Geoscience and Remote Sensing Magazine, 4(2):41–57, 2016

2016

[2] [2]

Position: Mission critical–satellite data is a distinct modality in machine learning

Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Position: Mission critical–satellite data is a distinct modality in machine learning. InForty-first International Conference on Machine Learning

[3] [3]

Geo-bench-2: From performance to capability, rethinking evaluation in geospatial ai, 2026

Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel, Salman Khan, Manil Maskey, Nicolas Longepe, Xiao Xiang Zhu, Hannah Kerner, Juan Bernabe- Moreno, and Alexandre Lacoste. Geo-bench-2: From performance to capability, rethinking evaluation in geospatial ai, 2026

2026

[4] [4]

Benchmarking neural network robustness to common corruptions and perturbations

Dan Hendrycks and Thomas Dietterich. Benchmarking neural network robustness to common corruptions and perturbations. InProceedings of the International Conference on Learning Representations, 2019

2019

[5] [5]

Temme, Jonas Rauber, Heiko H

Robert Geirhos, Carlos R. Temme, Jonas Rauber, Heiko H. Schütt, Matthias Bethge, and Felix A. Wichmann. Generalisation in humans and deep neural networks.Advances in Neural Information Processing Systems, 31, 2018

2018

[6] [6]

Towards deep learning models resistant to adversarial attacks

Aleksander Madry, Aleksandar Makelov, Ludwig Schmidt, Dimitris Tsipras, and Adrian Vladu. Towards deep learning models resistant to adversarial attacks. InProceedings of the International Conference on Learning Representations, 2017

2017

[7] [7]

Wichmann, and Wieland Brendel

Robert Geirhos, Patricia Rubisch, Claudio Michaelis, Matthias Bethge, Felix A. Wichmann, and Wieland Brendel. ImageNet-trained CNNs are biased towards texture; increasing shape bias improves accuracy and robustness. InProceedings of the International Conference on Learning Representations, 2018

2018

[8] [8]

Arbitrary style transfer in real-time with adaptive instance normalization

Xun Huang and Serge Belongie. Arbitrary style transfer in real-time with adaptive instance normalization. InProceedings of the IEEE International Conference on Computer Vision (ICCV), October 2017

2017

[9] [9]

Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, François Laviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks.Journal of Machine Learning Research, 17(59):1–35, 2016

2016

[10] [10]

Deep coral: Correlation alignment for deep domain adaptation

Baochen Sun and Kate Saenko. Deep coral: Correlation alignment for deep domain adaptation. InEuropean Conference on Computer Vision (ECCV), pages 443–450. Springer, 2016

2016

[11] [11]

Invariant Risk Minimization

Martin Arjovsky, Léon Bottou, Ishaan Gulrajani, and David Lopez-Paz. Invariant risk mini- mization. InarXiv preprint arXiv:1907.02893, 2019

work page internal anchor Pith review Pith/arXiv arXiv 1907

[12] [12]

Tent: Fully test-time adaptation by entropy minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. InInternational Conference on Learning Representations (ICLR), 2021

2021

[13] [13]

WILDS: A benchmark of in-the-wild distribution shifts

Pang Wei Koh, Shiori Sagawa, Henrik Marklund, Sang Michael Xie, Marvin Zhang, Akshay Balsubramani, Weixin Hu, Michihiro Yasunaga, Richard Lanas Phillips, Irena Gao, et al. WILDS: A benchmark of in-the-wild distribution shifts. InInternational Conference on Machine Learning, pages 5637–5664. PMLR, 2021

2021

[14] [14]

Evaluating machine accuracy on ImageNet

Vaishaal Shankar, Rebecca Roelofs, Horia Mania, Alex Fang, Benjamin Recht, and Ludwig Schmidt. Evaluating machine accuracy on ImageNet. InInternational Conference on Machine Learning, pages 8634–8644. PMLR, 2020. 10

2020

[15] [15]

ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models.Advances in Neural Information Processing Systems, 32, 2019

Andrei Barbu, David Mayo, Julian Alverio, William Luo, Chenyun Wang, Dan Gutfreund, Joshua Tenenbaum, and Boris Katz. ObjectNet: A large-scale bias-controlled dataset for pushing the limits of object recognition models.Advances in Neural Information Processing Systems, 32, 2019

2019

[16] [16]

Sahil Sachdeva, Ivan Lopez, Chandrashekhar Biradar, and David Lobell. A distribution shift benchmark for small-holder agroforestry: Do foundation models improve geographic gener- alization? InNeurIPS 2024 Workshop on Tackling Climate Change with Machine Learning, 2024

2024

[17] [17]

Measuring robustness to natural distribution shifts in image classification

Rohan Taori, Achal Dave, Vaishaal Shankar, Nicholas Carlini, Benjamin Recht, and Ludwig Schmidt. Measuring robustness to natural distribution shifts in image classification. InAdvances in Neural Information Processing Systems, volume 33, pages 18583–18599, 2020

2020

[18] [18]

Assessing out-of-domain generalization for robust building damage detection.arXiv preprint, 2020

Veit Benson and Alexander Ecker. Assessing out-of-domain generalization for robust building damage detection.arXiv preprint, 2020

2020

[19] [19]

Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with limited labels

Hannah Kerner, Siddharth Sundar, and Mohit Satish. Multi-region transfer learning for segmen- tation of crop field boundaries in satellite images with limited labels. InAAAI Conference on Artificial Intelligence (AAAI) Workshops, 2023

2023

[20] [20]

Geocrossbench: Cross-band generalization for remote sensing.arXiv preprint arXiv:2511.02831, 2025

Hakob Tamazyan, Ani Vanyan, Alvard Barseghyan, Anna Khosrovyan, Evan Shelhamer, and Hrant Khachatrian. Geocrossbench: Cross-band generalization for remote sensing.arXiv preprint arXiv:2511.02831, 2025

work page arXiv 2025

[21] [21]

Earthnets: Empowering AI in earth observation.arXiv preprint arXiv:2210.04936, 2022

Zhitong Xiong, Fahong Zhang, Yi Wang, Yilei Shi, and Xiao Xiang Zhu. Earthnets: Empowering AI in earth observation.arXiv preprint arXiv:2210.04936, 2022

work page arXiv 2022

[22] [22]

Geo-bench: Toward foundation models for earth monitoring, 2023

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Andrew Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, Mehmet Gunturkun, Gabriel Huang, David Vazquez, Dava Newman, Yoshua Bengio, Stefano Ermon, and Xiao Xiang Zhu. Geo-bench: Toward foundation models for earth monitoring, 2023

2023

[23] [23]

Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

2017

[24] [24]

Bag-of-visual-words and spatial extensions for land-use classifi- cation

Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classifi- cation. InACM SIGSPATIAL International Conference on Advances in Geographic Information Systems, 2010

2010

[25] [25]

Lavista Ferres, and Jennifer Marcus

Hannah Kerner, Snehal Chaudhari, Aninda Ghosh, Caleb Robinson, Adeel Ahmad, Eddie Choi, Nathan Jacobs, Chris Holmes, Matthias Mohr, Rahul Dodhia, Juan M. Lavista Ferres, and Jennifer Marcus. Fields of the world: A machine learning benchmark dataset for global agricultural field boundary segmentation, 2024

2024

[26] [26]

reBEN: Refined BigEarthNet dataset for remote sensing image analysis.arXiv preprint arXiv:2407.03653, 2024

Kai Norman Clasen, Leonard Hackel, Tom Burgert, Gencer Sumbul, Begüm Demir, and V olker Markl. reBEN: Refined BigEarthNet dataset for remote sensing image analysis.arXiv preprint arXiv:2407.03653, 2024

work page arXiv 2024

[27] [27]

Sen1floods11: a georeferenced dataset to train and test deep learning flood algorithms for sentinel-1

Derrick Bonafilia, Beth Tellman, Tyler Anderson, and Erica Issenberg. Sen1floods11: a georeferenced dataset to train and test deep learning flood algorithms for sentinel-1. In2020 IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 835–845, 2020

2020

[28] [28]

Do imagenet classifiers generalize to imagenet?, 2019

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to imagenet?, 2019

2019

[29] [29]

Deepglobe 2018: A challenge to parse the earth through satellite images

Ilke Demir, Krzysztof Koperski, David Lindenbaum, Guan Pang, Jing Huang, Saikat Basu, Forest Hughes, Devis Tuia, and Ramesh Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. InThe IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018

2018

[30] [30]

Data fusion contest 2022 (dfc2022), January 2022

Ronny Hänsch, Claudio Persello, Gemine Vivone, Javiera Castillo Navarro, Alexandre Boulch, Sebastien Lefevre, and Bertrand Le Saux. Data fusion contest 2022 (dfc2022), January 2022. 11

2022

[31] [31]

Neural plasticity-inspired foundation model for observing the Earth crossing modalities.arXiv preprint arXiv:2403.15356, 2024

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Joëlle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired foundation model for observing the Earth crossing modalities.arXiv preprint arXiv:2403.15356, 2024

work page arXiv 2024

[32] [32]

Anthony Fuller, Koreen Millard, and James R. Green. Croma: Remote sensing representations with contrastive radar-optical masked autoencoders, 2023

2023

[33] [33]

Clay foundation model: An open source ai model for earth, 2024

Clay Foundation. Clay foundation model: An open source ai model for earth, 2024. Apache-2.0 License

2024

[34] [34]

Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications, 2025

Daniela Szwarcman, Sujit Roy, Paolo Fraccaro, Þorsteinn Elí Gíslason, Benedikt Blumenstiel, Rinki Ghosal, Pedro Henrique de Oliveira, Joao Lucas de Sousa Almeida, Rocco Sedona, Yanghui Kang, Srija Chakraborty, Sizhe Wang, Carlos Gomes, Ankur Kumar, Myscon Truong, Denys Godwin, Hyunho Lee, Chia-Yu Hsu, Ata Akbari Asanjan, Besart Mujeci, Disha Shid- ham, Tr...

2025

[35] [35]

Terrafm: A scalable foundation model for unified multisensor earth observation.arXiv preprint arXiv:2506.06281, 2025

Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muham- mad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan, and Salman Khan. Terrafm: A scalable foundation model for unified multisensor earth observation.arXiv preprint arXiv:2506.06281, 2025

work page arXiv 2025

[36] [36]

Oriane Siméoni, Huy V . V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Michaël Ramamonjisoa, Francisco Massa, Daniel Haziza, Luca Wehrstedt, Jianyuan Wang, Timothée Darcet, Théo Moutakanni, Leonel Sentana, Claire Roberts, Andrea Vedaldi, Jamie Tolan, John Brandt, Camille Couprie, Julie...

2025

[37] [37]

Green, Evan Shelhamer, Hannah Kerner, and David Rolnick

Gabriel Tseng, Anthony Fuller, Marlena Reil, Henry Herzog, Patrick Beukema, Favyen Bastani, James R. Green, Evan Shelhamer, Hannah Kerner, and David Rolnick. Galileo: Learning global and local features of many remote sensing modalities, 2025

2025

[38] [38]

Terramind: Large-scale generative multimodality for earth observation, 2025

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brunschwiler, Gabriele Cavallaro, Juan Bernabe- Moreno, and Nicolas Longépé. Terramind: Large-scale generative multimodality for earth observation, 2025

2025

[39] [39]

Deep residual learning for image recognition, 2015

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition, 2015

2015

[40] [40]

An image is worth 16x16 words: Transformers for image recognition at scale, 2021

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale, 2021

2021

[41] [41]

Ramesh, Gabriel Goh, Sandish Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever

Alec Radford, Jong Wook Kim, Chris Hallacy, A. Ramesh, Gabriel Goh, Sandish Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InICML, 2021

2021

[42] [42]

Decoupled weight decay regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization. InInternational Conference on Learning Representations, 2019

2019

[43] [43]

Imagenet: A large- scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 248–255. IEEE, 2009

2009

[44] [44]

S. T. Brown, P. Buitrago, E. Hanna, S. Sanielevici, R. Scibek, and N. A. Nystrom. Bridges-2: A platform for rapidly-evolving and data intensive research. InPractice and Experience in Advanced Research Computing, pages 1–4, 2021. 12

2021

[45] [45]

licence ouverte

Douglas M. Jennewein, Johnathan Lee, Chris Kurtz, Will Dizon, Ian Shaeffer, Alan Chapman, Alejandro Chiquete, Josh Burks, Amber Carlson, Natalie Mason, Arhat Kobwala, Thirugnanam Jagadeesan, Praful Barghav, Torey Battelle, Rebecca Belshe, Debra McCaffrey, Marisa Brazil, Chaitanya Inumella, Kirby Kuznia, Jade Buzinski, Sean Dudley, Dhruvil Shah, Gil Speyer...

2023