pith. machine review for the scientific record. sign in

arxiv: 2605.12678 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.CY

Recognition: no theorem link

No One Knows the State of the Art in Geospatial Foundation Models

Authors on Pith no claims yet

Pith reviewed 2026-05-14 21:01 UTC · model grok-4.3

classification 💻 cs.CV cs.CY
keywords geospatial foundation modelsevaluation standardsreproducibilityEarth observationmodel benchmarkingpretraining configurationscommunity standardsmodel comparison
0
0 comments X

The pith

Nobody knows the state of the art in geospatial foundation models because papers cannot be compared.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Geospatial foundation models are proposed as general backbones for high-stakes tasks like disaster response and land-cover mapping, yet published work leaves users unable to identify the best model for any given task. An audit of 152 papers uncovers 46 cross-paper disagreements of at least 10 points on the same model, benchmark, and protocol, 94 unique pretraining configurations out of 126 extractable cases, and 39 percent of papers releasing no weights at all. These inconsistencies arise because the literature does not standardize evaluations, training protocols, weight releases, or controls. The authors frame this as a solvable coordination failure and outline six concrete expectations that would allow direct comparisons.

Core claim

The paper establishes that the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank models. A 152-paper audit finds 46 disagreements of 10+ points on identical model-benchmark-protocol triples, 94 papers using unique pretraining data configurations, and 39 percent of papers releasing no weights. The authors conclude that this prevents determination of the current state of the art and propose six expectations to remedy it.

What carries the argument

The 152-paper audit that quantifies performance disagreements, unique pretraining setups, and weight-release rates across the GFM literature.

If this is right

  • Models cannot be ranked or selected for specific tasks on the basis of published numbers.
  • Users cannot confidently choose the strongest GFM for applications such as disaster response or food-security monitoring.
  • Differences in architecture or pretraining cannot be isolated from differences in evaluation protocol.
  • Community progress on GFMs is slowed by the inability to build on or refute prior claims.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Similar coordination failures may exist in other domain-specific foundation model literatures that also lack shared harnesses.
  • Adopting the six proposed expectations would make it possible to run controlled experiments that separate data effects from architecture effects.
  • Widespread weight release under named licenses would enable independent groups to test models on new benchmarks without retraining from scratch.

Load-bearing premise

The sampled papers are representative of the full GFM literature and the observed disagreements stem primarily from missing standards rather than other factors.

What would settle it

A single shared evaluation harness applied to all existing models that produces consistent rankings with no 10-point disagreements on the same benchmarks.

Figures

Figures reproduced from arXiv: 2605.12678 by Anthony Fuller, Caleb Robinson, Evan Shelhamer, Gabriel Tseng, Hamed Alemohammad, Hannah Kerner, Isaac Corley, Jennifer Marcus, Nils Lehmann.

Figure 1
Figure 1. Figure 1: How the 152-paper corpus uses benchmarks. Panel (a) shows the top-10 benchmarks evaluated in the corpus; panel (b) shows that 35% of papers do not test on the most-used benchmarks at all; panel (c) shows that this pattern is not improving over time. Together the three panels say no GFM in the corpus can be ranked literature-wide, because the numbers needed for a fair comparison are not reported on enough s… view at source ↗
Figure 2
Figure 2. Figure 2: Papers report wildly different numbers for the “same” experiment. Across 301 cases with matching (model, benchmark, metric, protocol), many disagreed by ≥ 5, ≥ 10, or ≥20 points (left); the 10 largest gaps are shown right. The worst: Scale-MAE on NWPU-RESISC45 linear probing, 33.0 vs. 89.6 from the same checkpoint and nominal setup. Training stochasticity is ∼1 point, so these differences are far larger th… view at source ↗
Figure 3
Figure 3. Figure 3: Top-10 (of 87) named pretraining datasets across the 126 corpus papers that name one. MillionAID leads at just 9 papers (∼5.9% of 152); SSL4EO-S12 (8), fMoW (6), and fMoW-RGB (5) follow. When a paper changes both the model and the pre￾training data, readers cannot tell which change caused the gain unless one is held fixed. This is an attribution problem, not an argument for identical pretraining data. A fo… view at source ↗
read the original abstract

Geospatial foundation models (GFMs) have been proposed as generalizable backbones for disaster response, land-cover mapping, food-security monitoring, and other high-stakes Earth-observation tasks. Yet the published work about these models does not give reviewers or users enough information to tell which model fits a given task. We argue that nobody knows what the current state of the art is in geospatial foundation models. The methods may be useful, but the GFM literature does not standardize evaluations, training and testing protocols, released weights, or pretraining controls well enough for anyone to compare or rank them. In a 152-paper audit, we find 46 cross-paper disagreements of at least 10 points for the same model, benchmark, and protocol; 94/126 papers with extractable pretraining data use a configuration no other paper uses; and 39% of GFM papers release no model weights. This lack of community standards can be solved. We propose six concrete expectations: named-license weight release, shared core evaluations, copied-versus-rerun baseline annotations, variance reporting, one shared evaluation harness, and data-vs-architecture-vs-algorithm controls. These gaps are a coordination failure, not a fault of any individual lab; the authors of this paper, like many others in the GFM community, have contributed to them. Rather than just critiquing the community, we aim to provide concrete steps toward a shared understanding of how to innovate GFMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper audits 152 papers on geospatial foundation models (GFMs) and reports 46 cross-paper disagreements of at least 10 points on identical model-benchmark-protocol triples, 94/126 unique pretraining configurations, and 39% of papers releasing no weights. It concludes that these inconsistencies mean the community cannot determine the state of the art and proposes six concrete expectations (named-license weight release, shared core evaluations, copied-versus-rerun baselines, variance reporting, one shared harness, and data-vs-architecture-vs-algorithm controls) to remedy the coordination failure.

Significance. If the audit statistics are representative, the work identifies a systemic barrier to progress in a field with high-stakes applications. The constructive framing that credits the community (including the authors themselves) for the gaps, together with the explicit list of six expectations, gives the manuscript practical value beyond critique.

major comments (2)
  1. [Abstract and audit description] Abstract and audit description: the central counts (46 disagreements, 94/126 unique configs, 39% no weights) rest on an audit whose paper-selection criteria, exact disagreement measurement protocol, and inter-annotator agreement are not reported. These omissions are load-bearing for the claim that the observed inconsistencies prevent SOTA determination across the GFM literature.
  2. [Audit description] Audit description: no formal sampling frame or justification is given for why the 152-paper corpus is representative of the full GFM literature. Without this, the generalization from the observed 46 disagreements and 94 unique pretraining setups to the conclusion that “nobody knows” the SOTA remains under-supported.
minor comments (1)
  1. [Proposal section] The six proposed expectations are listed clearly but would benefit from a short table or bullet list that maps each expectation to the specific audit finding it addresses.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation for minor revision. We address the two major comments below and will revise the manuscript to provide the requested methodological details and justifications.

read point-by-point responses
  1. Referee: [Abstract and audit description] Abstract and audit description: the central counts (46 disagreements, 94/126 unique configs, 39% no weights) rest on an audit whose paper-selection criteria, exact disagreement measurement protocol, and inter-annotator agreement are not reported. These omissions are load-bearing for the claim that the observed inconsistencies prevent SOTA determination across the GFM literature.

    Authors: We agree that these details are essential for transparency and to support the central claims. In the revised manuscript we will add a dedicated 'Audit Methodology' subsection that specifies: the exact paper-selection criteria (search terms, databases, date range, and inclusion/exclusion rules); the precise protocol used to identify and count disagreements (how model-benchmark-protocol triples were matched and the 10-point threshold applied); and any inter-annotator agreement measures or validation steps employed during data extraction. These additions will make the audit reproducible and directly address the load-bearing concern. revision: yes

  2. Referee: [Audit description] Audit description: no formal sampling frame or justification is given for why the 152-paper corpus is representative of the full GFM literature. Without this, the generalization from the observed 46 disagreements and 94 unique pretraining setups to the conclusion that “nobody knows” the SOTA remains under-supported.

    Authors: We acknowledge that a formal sampling frame and explicit justification would strengthen the generalization. In revision we will insert a paragraph describing the systematic search strategy (keywords, sources, and temporal scope), the rationale for the 152-paper corpus, and a limitations discussion noting that the sample is not exhaustive but captures the dominant publication trends in the field. We will also qualify the 'nobody knows' conclusion to reflect that the observed inconsistencies are indicative rather than a complete census, while maintaining that they demonstrate a systemic coordination failure. revision: yes

Circularity Check

0 steps flagged

Empirical audit with no derivations or self-referential reductions

full rationale

The paper performs a direct count-based audit of 152 existing GFM papers, reporting observed frequencies of disagreements, unique pretraining configs, and weight-release rates. No equations, fitted parameters, predictions, or uniqueness theorems are invoked. The central claim follows immediately from the tabulated audit statistics without any intermediate derivation that could reduce to the inputs by construction. Self-mention of community contributions is incidental and not load-bearing.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that the audited sample captures the field's evaluation practices and that standardization is the primary missing ingredient for determining state of the art.

axioms (1)
  • domain assumption The 152-paper corpus is a representative sample of published GFM work
    The audit conclusions depend on this sample being broad enough to support the claim that no one knows the state of the art.

pith-pipeline@v0.9.0 · 5587 in / 1124 out tokens · 30555 ms · 2026-05-14T21:01:58.910678+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

80 extracted references · 80 canonical work pages · 4 internal anchors

  1. [1]

    Omnisat: Self- supervised modality fusion for earth observation

    Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self- supervised modality fusion for earth observation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024

  2. [2]

    Anysat: One earth observation model for many resolutions, scales, and modalities

    Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Anysat: One earth observation model for many resolutions, scales, and modalities. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 19530–19540, 2025

  3. [3]

    Satlaspretrain: A large-scale dataset for remote sensing image understanding

    Favyen Bastani, Piper Wolters, Ritwik Gupta, Joe Ferdinando, and Aniruddha Kembhavi. Satlaspretrain: A large-scale dataset for remote sensing image understanding. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16772–16782, 2023

  4. [4]

    Olmoearth: Stable latent image modeling for multimodal earth observation

    Favyen Bastani et al. Olmoearth: Stable latent image modeling for multimodal earth observation. arXiv preprint, 2025

  5. [5]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A. Hudson, Ehsan Adeli, et al. On the opportunities and risks of foundation models.arXiv preprint arXiv:2108.07258, 2021

  6. [6]

    Longllada: Unlocking long context capabilities in diffusion llms

    Nikolaos Ioannis Bountos, Arthur Ouaknine, Ioannis Papoutsis, and David Rolnick. FoMo: Multi-modal, multi-scale and multi-task remote sensing foundation models for forest monitoring. InAAAI Conference on Artificial Intelligence, pages 27858–27868, 2025. doi: 10.1609/aaai. v39i27.35002

  7. [7]

    Unreproducible research is reproducible

    Xavier Bouthillier, César Laurent, and Pascal Vincent. Unreproducible research is reproducible. InInternational Conference on Machine Learning, pages 725–734. PMLR, 2019

  8. [8]

    Accounting for variance in machine learning benchmarks

    Xavier Bouthillier, Pierre Delaunay, Mirko Bronzi, et al. Accounting for variance in machine learning benchmarks. InMLSys, 2021

  9. [9]

    Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291, 2025

    Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291, 2025

  10. [10]

    Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901, 2020

  11. [11]

    Emerging properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Hervé Jégou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. InProceedings of the IEEE/CVF international conference on computer vision, pages 9650–9660, 2021

  12. [12]

    Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts

    Soravit Changpinyo, Piyush Sharma, Nan Ding, and Radu Soricut. Conceptual 12m: Pushing web-scale image-text pre-training to recognize long-tail visual concepts. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 3558–3568, 2021

  13. [13]

    Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

    Gong Cheng, Junwei Han, and Xiaoqiang Lu. Remote sensing image scene classification: Benchmark and state of the art.Proceedings of the IEEE, 105(10):1865–1883, 2017

  14. [14]

    Functional map of the world

    Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018

  15. [15]

    Revisiting pre-trained remote sensing model benchmarks: Resizing and normalization matters

    Isaac Corley, Caleb Robinson, and Anthony Ortiz. Revisiting pre-trained remote sensing model benchmarks: Resizing and normalization matters. InIEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW), pages 3162–3172, 2024. doi: 10.1109/ CVPRW63382.2024.00322. 10

  16. [16]

    arXiv preprint arXiv :2506.06281 (2025)

    Muhammad Sohail Danish, Muhammad Akhtar Munir, Syed Roshaan Ali Shah, Muham- mad Haris Khan, Rao Muhammad Anwer, Jorma Laaksonen, Fahad Shahbaz Khan, and Salman Khan. Terrafm: A scalable foundation model for unified multisensor earth observation.arXiv preprint arXiv:2506.06281, 2025

  17. [17]

    The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

    Mostafa Dehghani, Yi Tay, Alexey A. Gritsenko, et al. The benchmark lottery.arXiv preprint arXiv:2107.07002, 2021

  18. [18]

    Imagenet: A large- scale hierarchical image database

    Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large- scale hierarchical image database. In2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009

  19. [19]

    Bert: Pre-training of deep bidirectional transformers for language understanding

    Jacob Devlin, Ming-Wei Chang, Kenton Lee, and Kristina Toutanova. Bert: Pre-training of deep bidirectional transformers for language understanding. InProceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers), pages 4171–4186, 2019

  20. [20]

    Data science at the singularity.Harvard Data Science Review, 6(1), 2024

    David Donoho. Data science at the singularity.Harvard Data Science Review, 6(1), 2024

  21. [21]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020

  22. [22]

    Phileo bench: Evaluating geo-spatial foundation models

    Casper Fibaek, Luke Camilleri, Andreas Luyts, Nikolaos Dionelis, and Bertrand Le Saux. Phileo bench: Evaluating geo-spatial foundation models. InIEEE International Geoscience and Remote Sensing Symposium (IGARSS), 2024

  23. [23]

    Open LLM leaderboard v2, 2024

    Clémentine Fourrier, Nathan Habib, Alina Lozovskaya, Konrad Hynek, and Thomas Wolf. Open LLM leaderboard v2, 2024. https://huggingface.co/spaces/ open-llm-leaderboard/open_llm_leaderboard

  24. [24]

    Major tom: Expandable datasets for earth observation

    Alistair Francis and Mikolaj Czerkawski. Major tom: Expandable datasets for earth observation. InIGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, pages 2935–2940. IEEE, 2024

  25. [25]

    Bad tables: Why you shouldn’t trust results tables in remote-sensing founda- tion model papers, 2026

    Anthony Fuller. Bad tables: Why you shouldn’t trust results tables in remote-sensing founda- tion model papers, 2026. URLhttps://antofuller.github.io/BAD_TABLES.pdf. Talk, ICLR Machine Learning for Remote Sensing Workshop, April 2026

  26. [26]

    Croma: Remote sensing representations with contrastive radar-optical masked autoencoders.Advances in Neural Information Processing Systems, 36:5506–5538, 2023

    Anthony Fuller, Koreen Millard, and James Green. Croma: Remote sensing representations with contrastive radar-optical masked autoencoders.Advances in Neural Information Processing Systems, 36:5506–5538, 2023

  27. [27]

    A framework for few-shot language model evaluation.Zenodo, 2024.lm-evaluation-harness

    Leo Gao, Jonathan Tow, Baber Abbasi, et al. A framework for few-shot language model evaluation.Zenodo, 2024.lm-evaluation-harness

  28. [28]

    Flair: a country-scale land cover semantic segmentation dataset from multi-source optical imagery.Advances in Neural Information Processing Systems, 36:16456–16482, 2023

    Anatol Garioud, Nicolas Gonthier, Loic Landrieu, Apolline De Wit, Marion Valette, Marc Poupée, Sébastien Giordano, et al. Flair: a country-scale land cover semantic segmentation dataset from multi-source optical imagery.Advances in Neural Information Processing Systems, 36:16456–16482, 2023

  29. [29]

    Terratorch: The geospatial foundation models toolkit

    Carlos Gomes, Benedikt Blumenstiel, Joao Lucas De Sousa Almeida, Pedro Henrique De Oliveira, Paolo Fraccaro, Francesc Marti Escofet, Daniela Szwarcman, Naomi Simumba, Romeo Kienzler, and Bianca Zadrozny. Terratorch: The geospatial foundation models toolkit. InIGARSS 2025-2025 IEEE International Geoscience and Remote Sensing Symposium, pages 6364–6368. IEEE, 2025

  30. [30]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 770–778, 2016. 11

  31. [31]

    Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019

  32. [32]

    Ringmo-agent: A unified remote sensing foun- dation model for multi-platform and multi-modal reasoning.arXiv preprint arXiv:2507.20776, 2025

    Huiyang Hu, Peijin Wang, Yingchao Feng, Kaiwen Wei, Wenxin Yin, Wenhui Diao, Mengyu Wang, Hanbo Bi, Kaiyue Kang, Tong Ling, et al. Ringmo-agent: A unified remote sensing foun- dation model for multi-platform and multi-modal reasoning.arXiv preprint arXiv:2507.20776, 2025

  33. [33]

    Mdas: A new multimodal benchmark dataset for remote sensing.Earth System Science Data, 15(1):113–131, 2023

    Jingliang Hu, Rong Liu, Danfeng Hong, Andrés Camero, Jing Yao, Mathias Schneider, Franz Kurz, Karl Segl, and Xiao Xiang Zhu. Mdas: A new multimodal benchmark dataset for remote sensing.Earth System Science Data, 15(1):113–131, 2023

  34. [34]

    Generic knowledge boosted pretraining for remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

    Ziyue Huang, Mingming Zhang, Yuan Gong, Qingjie Liu, and Yunhong Wang. Generic knowledge boosted pretraining for remote sensing images.IEEE Transactions on Geoscience and Remote Sensing, 62:1–13, 2024

  35. [35]

    Yuru Jia, Valerio Marsocci, Ziyang Gong, Xue Yang, Maarten Vergauwen, and Andrea Nascetti. Can generative geospatial diffusion models excel as discriminative geospatial foundation mod- els? InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 8429–8440, 2025

  36. [36]

    Spatial depen- dence between training and test sets: another pitfall of classification accuracy assessment in remote sensing.Machine Learning, 111:2715–2740, 2022

    Nicolas Karasiak, Jean-François Dejoux, Claude Monteil, and David Sheeren. Spatial depen- dence between training and test sets: another pitfall of classification accuracy assessment in remote sensing.Machine Learning, 111:2715–2740, 2022. doi: 10.1007/s10994-021-05972-1

  37. [37]

    Mahecha, and Carsten F

    Teja Kattenborn, Felix Schiefer, Julian Frey, Hannes Feilhauer, Miguel D. Mahecha, and Carsten F. Dormann. Spatially autocorrelated training and validation samples inflate perfor- mance assessment of convolutional neural networks.ISPRS Open Journal of Photogrammetry and Remote Sensing, 5:100018, 2022. doi: 10.1016/j.ophoto.2022.100018

  38. [38]

    Pretrain Where? Investigating How Pretraining Data Diversity Impacts Geospatial Foundation Model Performance

    Amandeep Kaur, Mirali Purohit, Gedeon Muhawenayo, Esther Rolf, and Hannah Kerner. Pretrain where? investigating how pretraining data diversity impacts geospatial foundation model performance.arXiv preprint arXiv:2604.21104, 2026

  39. [39]

    Segment anything

    Alexander Kirillov, Eric Mintun, Nikhila Ravi, Hanzi Mao, Chloe Rolland, Laura Gustafson, Tete Xiao, Spencer Whitehead, Alexander C Berg, Wan-Yen Lo, et al. Segment anything. In Proceedings of the IEEE/CVF international conference on computer vision, pages 4015–4026, 2023

  40. [40]

    Bernard Koch, Emily Denton, Alex Hanna, and Jacob G. Foster. Reduced, reused and recycled: The life of a dataset in machine learning research. InNeurIPS Datasets and Benchmarks, 2021

  41. [41]

    GEO-Bench: Toward foundation models for earth monitoring

    Alexandre Lacoste, Nils Lehmann, Pau Rodríguez Castaño, et al. GEO-Bench: Toward foundation models for earth monitoring. InNeurIPS Datasets and Benchmarks, 2023

  42. [42]

    Geo-bench: Toward foundation models for earth monitoring.Advances in Neural Information Processing Systems, 36:51080–51093, 2023

    Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. Geo-bench: Toward foundation models for earth monitoring.Advances in Neural Information Processing Systems, 36:51080–51093, 2023

  43. [43]

    Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020

    Ke Li, Gang Wan, Gong Cheng, Liqiu Meng, and Junwei Han. Object detection in optical remote sensing images: A survey and a new benchmark.ISPRS journal of photogrammetry and remote sensing, 159:296–307, 2020

  44. [44]

    Masked angle-aware autoencoder for remote sensing images

    Zhihao Li, Biao Hou, Siteng Ma, Zitong Wu, Xianpeng Guo, Bo Ren, and Licheng Jiao. Masked angle-aware autoencoder for remote sensing images. InEuropean Conference on Computer Vision, pages 260–278. Springer, 2024

  45. [45]

    Holistic evaluation of language models

    Percy Liang, Rishi Bommasani, Tony Lee, et al. Holistic evaluation of language models. Transactions on Machine Learning Research, 2023. 12

  46. [46]

    Troubling trends in machine learning scholarship: Some ml papers suffer from flaws that could mislead the public and stymie future research

    Zachary C Lipton and Jacob Steinhardt. Troubling trends in machine learning scholarship: Some ml papers suffer from flaws that could mislead the public and stymie future research. Queue, 17(1):45–77, 2019

  47. [47]

    Docling: An efficient open-source toolkit for AI-driven document conversion

    Nikolaos Livathinos, Christoph Auer, Maksym Lysak, Ahmed Nassar, Michele Dolfi, Panos Vagenas, Cesar Berrospi Ramis, Matteo Omenetti, Kasper Dinkla, Yusik Kim, et al. Do- cling: An efficient open-source toolkit for ai-driven document conversion.arXiv preprint arXiv:2501.17887, 2025

  48. [48]

    Yang Long, Gui-Song Xia, Shengyang Li, Wen Yang, Michael Ying Yang, Xiao Xiang Zhu, Liangpei Zhang, and Deren Li. On creating benchmark dataset for aerial image interpreta- tion: Reviews, guidances and million-aid.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 14:4205–4230, 2021

  49. [49]

    Zimmer-Dauphinee, et al

    Siqi Lu, Junlin Guo, James R. Zimmer-Dauphinee, et al. Vision foundation models in remote sensing: A survey.IEEE Geoscience and Remote Sensing Magazine, 2024

  50. [50]

    Sea- sonal contrast: Unsupervised pre-training from uncurated remote sensing data

    Oscar Manas, Alexandre Lacoste, Xavier Giró-i Nieto, David Vazquez, and Pau Rodriguez. Sea- sonal contrast: Unsupervised pre-training from uncurated remote sensing data. InProceedings of the IEEE/CVF international conference on computer vision, pages 9414–9423, 2021

  51. [51]

    Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

    Valerio Marsocci, Yuru Jia, Gilles Le Bellier, et al. PANGAEA: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024

  52. [52]

    Towards geospatial foundation models via continual pretraining

    Matías Mendieta, Boran Han, Xingjian Shi, Yi Zhu, and Chen Chen. Towards geospatial foundation models via continual pretraining. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16806–16816, 2023

  53. [53]

    Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning

    Vishal Nedungadi, Ankit Kariryaa, Stefan Oehmcke, Serge Belongie, Christian Igel, and Nico Lang. Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning. In European Conference on Computer Vision, pages 164–182. Springer, 2024

  54. [54]

    Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 2022

    Simon Ott, Adriano Barbosa-Silva, Kathrin Blagec, Jan Brauner, and Matthias Samwald. Mapping global dynamics of benchmark creation and saturation in artificial intelligence.Nature Communications, 2022

  55. [55]

    Planted: a dataset for planted forest identification from multi- satellite time series

    Luis Miguel Pazos-Outón, Cristina Nader Vasconcelos, Anton Raichuk, Anurag Arnab, Dan Morris, and Maxim Neumann. Planted: a dataset for planted forest identification from multi- satellite time series. InIGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium, pages 7066–7070. IEEE, 2024

  56. [56]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervision. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021

  57. [57]

    Scale-mae: A scale- aware masked autoencoder for multiscale geospatial representation learning

    Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale- aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088–4099, 2023

  58. [58]

    Position: Mission critical – satellite data is a distinct modality in machine learning.ICML, 2024

    Esther Rolf, Konstantin Klemmer, Caleb Robinson, and Hannah Kerner. Position: Mission critical – satellite data is a distinct modality in machine learning.ICML, 2024

  59. [59]

    SEN12MS -- A Curated Dataset of Georeferenced Multi-Spectral Sentinel-1/2 Imagery for Deep Learning and Data Fusion

    Michael Schmitt, Lloyd Haydn Hughes, Chunping Qiu, and Xiao Xiang Zhu. Sen12ms–a curated dataset of georeferenced multi-spectral sentinel-1/2 imagery for deep learning and data fusion.arXiv preprint arXiv:1906.07789, 2019

  60. [60]

    Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022

    Christoph Schuhmann, Romain Beaumont, Richard Vencu, Cade Gordon, Ross Wightman, Mehdi Cherti, Theo Coombes, Aarush Katta, Clayton Mullis, Mitchell Wortsman, et al. Laion- 5b: An open large-scale dataset for training next generation image-text models.Advances in neural information processing systems, 35:25278–25294, 2022. 13

  61. [61]

    Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning

    Piyush Sharma, Nan Ding, Sebastian Goodman, and Radu Soricut. Conceptual captions: A cleaned, hypernymed, image alt-text dataset for automatic image captioning. InProceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 2556–2565. Association for Computational Linguistics, 2018

  62. [62]

    Geo- bench-2: From performance to capability, rethinking evaluation in geospatial ai.arXiv preprint arXiv:2511.15658, 2025

    Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel, Salman Khan, Manil Maskey, Nicolas Longepe, Xiao Xiang Zhu, Hannah Kerner, et al. Geo- bench-2: From performance to capability, rethinking evaluation in geospatial ai.arXiv preprint arXiv:2511.15658, 2025

  63. [63]

    Earthdial: Turning multi-sensory earth observations to interactive dialogues

    Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shah- baz Khan, et al. Earthdial: Turning multi-sensory earth observations to interactive dialogues. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 14303–14313, 2025

  64. [64]

    Beyond the imitation game: Quanti- fying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023

    Aarohi Srivastava, Abhinav Rastogi, Abhishek Rao, et al. Beyond the imitation game: Quanti- fying and extrapolating the capabilities of language models.Transactions on Machine Learning Research, 2023

  65. [65]

    Torchgeo: deep learning with geospatial data.ACM Transactions on Spatial Algorithms and Systems, 11(4):1–28, 2025

    Adam J Stewart, Caleb Robinson, Isaac A Corley, Anthony Ortiz, Juan M Lavista Ferres, and Arindam Banerjee. Torchgeo: deep learning with geospatial data.ACM Transactions on Spatial Algorithms and Systems, 11(4):1–28, 2025

  66. [66]

    Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022

    Xian Sun, Peijin Wang, Zhiyuan Yan, Feng Xu, Ruiping Wang, Wenhui Diao, Jin Chen, Jihao Li, Yingchao Feng, Tao Xu, et al. Fair1m: A benchmark dataset for fine-grained object recognition in high-resolution remote sensing imagery.ISPRS Journal of Photogrammetry and Remote Sensing, 184:116–130, 2022

  67. [67]

    Chao Tao, Ji Qi, Guo Zhang, Qing Zhu, Weipeng Lu, and Haifeng Li. Tov: The original vision model for optical remote sensing image understanding via self-supervised learning.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 16:4916–4930, 2023

  68. [68]

    Galileo: Learning global & local features of many remote sensing modalities

    Gabriel Tseng, Ruben Cartuyvels, Ivan Zvonkov, Mirali Purohit, David Rolnick, and Hannah Kerner. Galileo: Learning global & local features of many remote sensing modalities. In Proceedings of the International Conference on Machine Learning, 2025

  69. [69]

    Panopticon: Advancing any-sensor foundation models for earth observation

    Leonard Waldmann, Ando Shah, Yi Wang, Nils Lehmann, Adam Stewart, Zhitong Xiong, Xiao Xiang Zhu, Stefan Bauer, and John Chuang. Panopticon: Advancing any-sensor foundation models for earth observation. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 2204–2214, 2025

  70. [70]

    Harnessing massive satellite imagery with efficient masked image modeling

    Fengxiang Wang, Hongzhen Wang, Di Wang, Zonghao Guo, Zhenyu Zhong, Long Lan, Wenjing Yang, and Jing Zhang. Harnessing massive satellite imagery with efficient masked image modeling. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 6935–6947, 2025

  71. [71]

    Yi Wang, Nassim Ait Ali Braham, Zhitong Xiong, Chenying Liu, Conrad M Albrecht, and Xiao Xiang Zhu. Ssl4eo-s12: A large-scale multimodal, multitemporal dataset for self- supervised learning in earth observation [software and data sets].IEEE Geoscience and Remote Sensing Magazine, 11(3):98–106, 2023

  72. [72]

    Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

    Gui-Song Xia, Jingwen Hu, Fan Hu, Baoguang Shi, Xiang Bai, Yanfei Zhong, Liangpei Zhang, and Xiaoqiang Lu. Aid: A benchmark data set for performance evaluation of aerial scene classification.IEEE Transactions on Geoscience and Remote Sensing, 55(7):3965–3981, 2017

  73. [73]

    Dota: A large-scale dataset for object detection in aerial images

    Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Belongie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liangpei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3974–3983, 2018. 14

  74. [74]

    Foundation models for remote sensing and earth observation: A survey.IEEE Geoscience and Remote Sensing Magazine, 2025

    Aoran Xiao, Weihao Xuan, Junjue Wang, et al. Foundation models for remote sensing and earth observation: A survey.IEEE Geoscience and Remote Sensing Magazine, 2025

  75. [75]

    Neural plasticity- inspired multimodal foundation model for earth observation.arXiv preprint arXiv:2403.15356, 2024

    Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J Stewart, Joelle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity- inspired multimodal foundation model for earth observation.arXiv preprint arXiv:2403.15356, 2024

  76. [76]

    Bag-of-visual-words and spatial extensions for land-use classifi- cation

    Yi Yang and Shawn Newsam. Bag-of-visual-words and spatial extensions for land-use classifi- cation. InACM SIGSPATIAL International Conference on Advances in Geographic Information Systems (ACM GIS), pages 270–279, 2010

  77. [77]

    A large-scale study of representation learning with the visual task adaptation benchmark

    Xiaohua Zhai, Joan Puigcerver, Alexander Kolesnikov, Pierre Ruyssen, Carlos Riquelme, Mario Lucic, Josip Djolonga, Andre Susano Pinto, Maxim Neumann, Alexey Dosovitskiy, et al. A large-scale study of representation learning with the visual task adaptation benchmark.arXiv preprint arXiv:1910.04867, 2019

  78. [78]

    Ctxmim: Context-enhanced masked image modeling for remote sensing image understanding.ACM Transactions on Multimedia Computing, Communications and Applications, 21(12):1–22, 2025

    Mingming Zhang, Qingjie Liu, and Yunhong Wang. Ctxmim: Context-enhanced masked image modeling for remote sensing image understanding.ACM Transactions on Multimedia Computing, Communications and Applications, 21(12):1–22, 2025

  79. [79]

    Earthgpt: A universal multimodal large language model for multisensor image comprehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024

    Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multimodal large language model for multisensor image comprehension in remote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 62:1–20, 2024

  80. [80]

    Rs5m and georsclip: A large- scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–23, 2024

    Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. Rs5m and georsclip: A large- scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–23, 2024. A Reproducibility The supplementary repository contains the 152-paper corpus, extraction prompts, normalization code, ha...