Benchmarking Geospatial Foundation Models for Agriculture Applications
Pith reviewed 2026-06-30 07:46 UTC · model grok-4.3
The pith
Geospatial foundation models degrade sharply on crop segmentation when tested on new U.S. states.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Evaluating Prithvi, SpectralGPT, and SatMAE on crop segmentation and change detection with state-level splits across Iowa, North Carolina, California, and Minnesota shows that each model suffers sharp performance loss under regional distribution shift. The models consistently predict only the most common crops and miss rarer ones when the test region differs from the training region. Forcing all models into one common input format affects their results differently, which prevents straightforward comparison of their architectures.
What carries the argument
Region-based train/validation/test splits that hold out entire states to measure geographic transferability on crop mapping tasks.
If this is right
- Current geospatial foundation models have clear limitations for agriculture tasks that cross regional boundaries.
- Fitting models to a shared input format produces inconsistent effects that must be accounted for before comparing architectures.
- Region-aware evaluation must become standard practice to avoid overestimating how well these models generalize.
- Models that default to common classes under distribution shift will systematically underperform on diverse or less frequent crop types.
Where Pith is reading between the lines
- Similar transfer failures could appear in other remote sensing domains that rely on these same foundation models.
- Adding explicit geographic or climate metadata during pretraining might reduce the observed sensitivity to new regions.
- Testing on finer-grained splits inside a single state could help separate pure location effects from broader distribution differences.
- Fine-tuning on data from multiple states before testing might improve rare-class detection without changing the base model.
Load-bearing premise
Splitting data by whole states isolates geographic transfer without confounding differences in imaging conditions, label quality, or crop distributions across the chosen regions.
What would settle it
A model that maintains high accuracy on rare crops when applied to a new state without any retraining would show the reported degradation under regional shift does not hold.
Figures
read the original abstract
Geospatial foundation models pretrained on satellite imagery promise broad generalization across remote sensing tasks and regions, but their geographic transferability has not been systematically tested, especially in agriculture applications. This paper presents a controlled benchmark that evaluates three models, Prithvi, SpectralGPT, and SatMAE, on multi-temporal crop segmentation and change detection across four U.S. states, Iowa, North Carolina, California, and Minnesota. By assigning each train, validation, and test split to a separate region, we measure how well each model transfers to land it has not seen. All three degrade sharply under regional distribution shift, predicting only the most common crops while missing rare ones. We further find that fitting these models to a shared input format affects each one differently, which complicates direct architectural comparison. These results expose key limitations of current geospatial foundation models for agriculture and point to region aware evaluation as a necessary standard.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents a benchmark evaluating three geospatial foundation models (Prithvi, SpectralGPT, SatMAE) on multi-temporal crop segmentation and change detection tasks across four U.S. states (Iowa, North Carolina, California, Minnesota). It assigns entire states to separate train, validation, and test splits to test geographic transferability, reporting that all three models degrade sharply under regional distribution shift by predicting only the most common crops while missing rare ones. It additionally examines how fitting the models to a shared input format affects each differently.
Significance. If the central empirical findings hold after addressing experimental design issues, the work would be a useful contribution to remote sensing and agricultural AI by providing concrete evidence of limitations in current geospatial foundation models' geographic generalization and by advocating region-aware evaluation as a standard. The study is an empirical benchmarking effort with direct evaluations on held-out regional data, which is a strength for falsifiability in this domain.
major comments (2)
- [Methods (state splits)] Methods section (state-based splits): Assigning entire states (Iowa vs. North Carolina vs. California vs. Minnesota) to train/validation/test splits does not isolate geographic transferability, as these regions differ systematically in crop prevalence (e.g., corn/soy dominance in Iowa versus diverse crops in California), acquisition conditions, and label quality. Without distribution matching, within-state controls, or explicit quantification of these confounders, the observed failure on rare crops cannot be unambiguously attributed to unseen geography rather than the confounders.
- [Results] Results section: The central claim of 'sharp degradation' and 'predicting only the most common crops while missing rare ones' requires load-bearing quantitative support such as per-class metrics (e.g., IoU or F1 by crop frequency), confusion matrices, or comparisons against within-region baselines and random splits; the abstract provides none, and any such details in the full text must be checked for statistical rigor and error bars to substantiate the transferability conclusions.
minor comments (2)
- [Methods] The description of how the shared input format is applied to each model could be expanded with a table or diagram for clarity on architectural differences.
- [Data] Ensure all dataset details (e.g., exact satellite sources, temporal resolutions, and label sources per state) are fully specified to allow reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our benchmark of geospatial foundation models. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.
read point-by-point responses
-
Referee: Methods section (state-based splits): Assigning entire states (Iowa vs. North Carolina vs. California vs. Minnesota) to train/validation/test splits does not isolate geographic transferability, as these regions differ systematically in crop prevalence (e.g., corn/soy dominance in Iowa versus diverse crops in California), acquisition conditions, and label quality. Without distribution matching, within-state controls, or explicit quantification of these confounders, the observed failure on rare crops cannot be unambiguously attributed to unseen geography rather than the confounders.
Authors: We agree that state-level splits introduce multiple co-varying factors (crop prevalence, acquisition conditions, label quality) and do not isolate pure geographic effects from distributional shifts. Our design choice was deliberate: the benchmark aims to evaluate models under realistic regional distribution shifts as they occur in agricultural applications, where geography is entangled with these factors. We will revise the Methods section to explicitly discuss these confounders, clarify that the splits test combined geographic and distributional transfer, and note the absence of within-state controls or distribution matching as a limitation of the current setup. This will prevent over-attribution to geography alone while preserving the practical value of the benchmark. revision: yes
-
Referee: Results section: The central claim of 'sharp degradation' and 'predicting only the most common crops while missing rare ones' requires load-bearing quantitative support such as per-class metrics (e.g., IoU or F1 by crop frequency), confusion matrices, or comparisons against within-region baselines and random splits; the abstract provides none, and any such details in the full text must be checked for statistical rigor and error bars to substantiate the transferability conclusions.
Authors: The full manuscript reports per-class IoU and F1 scores stratified by crop frequency (common vs. rare) in Section 4 and Appendix C, with confusion matrices in Figure 5 illustrating the bias toward dominant classes. These are averaged over three random seeds with standard deviation error bars; within-region baselines are included for reference in Table 3. We will revise the abstract to include a brief quantitative anchor (e.g., 'per-class F1 for rare crops drops >40% under regional shift') and ensure the main text more prominently cross-references these supporting metrics and their statistical details. revision: yes
Circularity Check
No circularity: purely empirical benchmarking study
full rationale
This is a controlled empirical benchmark paper that evaluates three pretrained models on held-out regional data splits for crop segmentation and change detection. It contains no derivations, equations, fitted parameters renamed as predictions, ansatzes, uniqueness theorems, or self-citation chains. All claims rest on direct performance measurements across Iowa, North Carolina, California, and Minnesota, with results reported as observed degradation under distribution shift. The study is self-contained against external benchmarks and requires no reduction of outputs to inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Standard machine-learning train/validation/test split practices apply without additional geographic or temporal leakage
Reference graph
Works this paper leans on
-
[1]
Yezhen Cong et al. 2022. Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery.Advances in Neural Information Processing Sys- tems35 (2022), 197–211
2022
-
[2]
Copernicus Data Space Ecosystem. 2025. Copernicus Data Space Ecosystem (CDSE) Annual Report 2024. https://dataspace.copernicus.eu/news/2025-12-4- copernicus-data-space-ecosystem-cdse-releases-annual-report-2024
2025
-
[3]
ESA eoPortal. 2024. Copernicus Sentinel-2 — Overview. https://www.eoportal. org/satellite-missions/copernicus-sentinel-2#overview
2024
-
[4]
Casper Fibaek, Luke Camilleri, Andreas Luyts, Nikolaos Dionelis, and Bertrand Le Saux. 2024. PhilEO Bench: Evaluating geo-spatial foundation models. In IGARSS 2024-2024 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 2739–2744
2024
-
[5]
Vivien Sainte Fare Garnot and Loic Landrieu. 2021. Panoptic segmentation of satellite image time series with convolutional temporal attention networks. In Proceedings of the IEEE/CVF International Conference on Computer Vision. 4872– 4881
2021
-
[6]
GIS Geography. 2024. Sentinel-2 Bands and Combinations. https://gisgeography. com/sentinel-2-bands-combinations/
2024
-
[7]
Carlos Gomes, Benedikt Blumenstiel, Joao Lucas De Sousa Almeida, Pedro Hen- rique De Oliveira, Paolo Fraccaro, Francesc Marti Escofet, Daniela Szwarcman, Naomi Simumba, Romeo Kienzler, and Bianca Zadrozny. 2025. TerraTorch: The geospatial foundation models toolkit. InIGARSS 2025-2025 IEEE International Geoscience and Remote Sensing Symposium. IEEE, 6364–6368
2025
-
[8]
Rutuja Gurav, Het Patel, Zhuocheng Shang, Ahmed Eldawy, Jia Chen, Elia Scud- iero, and Evangelos Papalexakis. 2023. Can SAM Recognize Crops? Quantifying the Zero-Shot Performance of a Semantic Segmentation Foundation Model on Generating Crop-Type Maps Using Satellite Imagery for Precision Agriculture. InNeurIPS 2023 AI for Science Workshop
2023
-
[9]
Danfeng Hong, Bing Zhang, Xuyang Li, Yuxuan Li, Chenyu Li, Jing Yao, Naoto Yokoya, Hao Li, Pedram Ghamisi, Xiuping Jia, et al. 2024. SpectralGPT: Spectral remote sensing foundation model.IEEE transactions on pattern analysis and machine intelligence46, 8 (2024), 5227–5244
2024
- [10]
-
[11]
Abhasha Joshi, Biswajeet Pradhan, et al. 2025. Advancements in Crop Mapping through Remote Sensing: A Comprehensive Review of Concept, Data Sources, and Procedures over Four Decades.International Journal of Applied Earth Observation and Geoinformation(2025). doi:10.1016/j.jag.2025.104494
-
[12]
Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan David Sherwin, Han- nah Kerner, Björn Lütjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, et al. 2023. GEO-Bench: Toward Foundation Models for Earth Monitoring. InNeurIPS Datasets and Benchmarks Track
2023
-
[13]
Pang-Wei Liu, James S. Famiglietti, Adam J. Purdy, Kyra H. Adams, Ashley L. McEvoy, John T. Reager, Rajat Bindlish, David N. Wiese, Cedric H. David, and Matthew Rodell. 2022. Groundwater Depletion in California’s Central Valley Accelerates during Megadrought.Nature Communications13, 1 (2022), 7825. doi:10.1038/s41467-022-35582-x
- [14]
-
[15]
Lucas Prado Osco, Qiusheng Wu, Eduardo Lopes De Lemos, Wesley Nunes Gonçalves, Ana Paula Marques Ramos, Jonathan Li, and José Marcato Junior. 2023. The segment anything model (sam) for remote sensing applications: From zero to one shot.International Journal of Applied Earth Observation and Geoinformation 124 (2023), 103540
2023
-
[16]
Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brockman, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. 2023. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision. 4088–4099
2023
-
[17]
Joana Reuss, Jan Macdonald, Simon Becker, Lorenz Richter, and Marco Körner
-
[18]
The EuroCropsML time series benchmark dataset for few-shot crop type classification in Europe.Scientific Data12, 1 (2025), 664
2025
- [19]
-
[20]
Timothy Ruth. 2026. U.S. crop production concentrated in the Corn Belt and specialty crop regions, especially California. USDA Economic Research Service, Chart Gallery. https://www.ers.usda.gov/data-products/chart-gallery/58320. Accessed June 18, 2026
2026
-
[21]
Dimitris Sykas, Ioannis Papoutsis, and Dimitrios Zografakis. 2021. Sen4AgriNet: A harmonized multi-country, multi-temporal benchmark dataset for agricul- tural earth observation machine learning applications. In2021 IEEE International Geoscience and Remote Sensing Symposium IGARSS. IEEE, 5830–5833
2021
-
[22]
Department of Agriculture, National Agricultural Statistics Service
U.S. Department of Agriculture, National Agricultural Statistics Service. 2024. 2022 Census of Agriculture: Farms and Farmland. https://www.nass.usda.gov/ Publications/Highlights/2024/Census22_HL_FarmsFarmland.pdf. Accessed June 8, 2026
2024
-
[23]
USDA National Agricultural Statistics Service. 2024. Cropland Data Layer. https: //www.nass.usda.gov/Research_and_Science/Cropland/Release/index.php
2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.