Do Foundation Model Embeddings Improve Cross-Country Crop Yield Generalisation? A Leave-One-Country-Out Evaluation in Sub-Saharan Africa
Pith reviewed 2026-05-12 01:23 UTC · model grok-4.3
The pith
Frozen embeddings from geospatial foundation models provide no advantage over traditional spectral features for predicting maize yields across different countries.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Under leave-one-country-out testing on 6404 observations from five countries, within-country random cross-validation produces moderate R-squared values while cross-country evaluation yields negative R-squared for every tested feature set, including frozen Prithvi-EO and ViT-Base embeddings, which show no meaningful gain over engineered Sentinel-2 spectral features; the performance collapse is attributed primarily to shifts in yield distributions across countries rather than representation quality.
What carries the argument
The leave-one-country-out cross-validation protocol that withholds all observations from one country for testing while training on the remaining four countries, applied to compare frozen foundation model embeddings against hand-engineered spectral features from Sentinel-2 imagery.
If this is right
- Within-country random splits substantially overestimate the generalisability of yield prediction models to new countries.
- Engineered spectral features from Sentinel-2 perform at least as well as frozen Prithvi-EO embeddings under strict cross-country conditions.
- All current representation approaches fail to produce useful predictions when yield levels differ markedly between training and test countries.
- Releasing the negative benchmark enables future work to measure progress against a reproducible cross-country baseline.
Where Pith is reading between the lines
- Explicit correction for country-level yield shifts, such as mean adjustment or domain adaptation layers, could be tested as a direct follow-up to isolate the effect.
- The same leave-one-country-out design applied to other regions or crops would reveal whether distribution shifts are a general obstacle in agricultural remote sensing.
- Fine-tuning the foundation model weights instead of keeping them frozen might alter the relative performance of embeddings versus spectral features.
Load-bearing premise
The assumption that the observed collapse in cross-country performance is driven mainly by differences in yield distributions between countries rather than by variations in data quality, sensor characteristics, or unaccounted environmental factors.
What would settle it
If matching the statistical distribution of yields between the training countries and the held-out country restored positive predictive performance while keeping all other factors fixed, that would indicate distribution shift is not the primary cause.
Figures
read the original abstract
Accurate predictions of smallholder maize yields across national boundaries are critical for food security planning in sub-Saharan Africa, yet most published benchmarks report within-country performance that overstates true generalisability. This paper evaluates whether geospatial foundation model embeddings, specifically Prithvi-EO-1.0-100M and ViT-Base, outperform traditional Sentinel-2 spectral features under a Leave-One-Country-Out cross-validation scheme on 6,404 maize field observations from five African countries. The results show a clear generalisability gap: within-country random cross-validation yields moderate R^2 values, but all feature sets perform poorly under cross-country testing, with universally negative R^2. Frozen Prithvi-EO embeddings provide no meaningful advantage over engineered spectral features for cross-country prediction in this setting. The paper argues that the main limitation is a shift in yield distribution between countries rather than representation quality and releases a reproducible negative benchmark for future work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript evaluates whether embeddings from geospatial foundation models (Prithvi-EO-1.0-100M and ViT-Base) improve cross-country generalization for predicting smallholder maize yields in Sub-Saharan Africa compared to engineered Sentinel-2 spectral features. Using leave-one-country-out cross-validation on 6,404 field observations from five countries, it reports moderate within-country R² but universally negative R² in cross-country settings for all feature sets, concluding that frozen Prithvi-EO embeddings offer no advantage and that yield distribution shifts across countries are the main limitation rather than representation quality. A reproducible negative benchmark is released.
Significance. If substantiated, this work offers an important negative result for the application of foundation models in agricultural remote sensing, emphasizing the dominance of distribution shifts in limiting generalization. The provision of a reproducible benchmark is a positive contribution that can guide future research on handling cross-domain challenges in crop yield prediction.
major comments (2)
- [Abstract] Abstract: The claim that 'the main limitation is a shift in yield distribution between countries rather than representation quality' lacks isolating evidence, as no per-country yield histograms, label variance statistics, cloud-cover metrics, or ablation holding yield range fixed while varying only the feature extractor are reported.
- [Evaluation setup] Evaluation setup: The LOCO performance collapse is shown for all feature sets, but without controls for country-specific data artifacts (label noise, sampling protocols, or sensor conditions), the causal attribution to yield shifts alone is under-supported and load-bearing for the interpretation that representation quality is not the issue.
minor comments (2)
- [Methods] Methods: Expand details on per-country field counts, exact Sentinel-2 band combinations for engineered features, and cloud-masking preprocessing to improve reproducibility of the negative benchmark.
- [Results] Results: Report error bars or standard deviations on R² values across CV folds to better quantify the within-country vs. cross-country performance gap.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback, which highlights opportunities to strengthen the evidential basis for our interpretation of the results. We address each major comment below and will incorporate revisions to improve clarity and support for the claims.
read point-by-point responses
-
Referee: [Abstract] Abstract: The claim that 'the main limitation is a shift in yield distribution between countries rather than representation quality' lacks isolating evidence, as no per-country yield histograms, label variance statistics, cloud-cover metrics, or ablation holding yield range fixed while varying only the feature extractor are reported.
Authors: We agree that additional visualizations and statistics would better isolate the role of yield distribution shifts. In the revised manuscript we will add per-country yield histograms, label variance statistics, and average cloud-cover metrics derived from the Sentinel-2 acquisitions. An ablation that strictly holds yield range fixed while varying only the feature extractor is difficult to implement without introducing selection bias or synthetic labels; we will instead add a dedicated limitations paragraph discussing this constraint and explaining why the uniform negative R² across all feature families (engineered spectral, Prithvi, ViT) still supports our conclusion that representation quality is not the primary bottleneck. revision: partial
-
Referee: [Evaluation setup] Evaluation setup: The LOCO performance collapse is shown for all feature sets, but without controls for country-specific data artifacts (label noise, sampling protocols, or sensor conditions), the causal attribution to yield shifts alone is under-supported and load-bearing for the interpretation that representation quality is not the issue.
Authors: The LOCO protocol is chosen precisely because it reflects operational cross-border prediction; the fact that every representation family collapses similarly indicates the failure is not representation-specific. The underlying dataset follows a harmonized collection protocol across countries (detailed in Section 3), which reduces but does not eliminate possible artifacts. We will expand the evaluation-setup subsection to include available metadata on label-collection procedures and sensor conditions, and we will add an explicit discussion of residual country-specific factors. While we cannot retroactively introduce new controls that were not part of the original data collection, the consistent pattern across feature types remains the strongest available evidence that domain shift, rather than representation quality, drives the observed generalization gap. revision: partial
Circularity Check
No circularity: purely empirical benchmark with direct CV comparisons
full rationale
The paper reports an empirical leave-one-country-out evaluation of feature sets (Prithvi-EO embeddings vs. Sentinel-2 spectral features) on 6,404 real maize yield observations, producing standard performance metrics such as R² under within-country and cross-country splits. No equations, derivations, or parameter-fitting steps are described that could reduce a claimed prediction to a quantity defined by the inputs themselves. The central observation (negative cross-country R² for all methods) follows directly from applying off-the-shelf models to held-out country data; the attribution to yield distribution shift is presented as an interpretive hypothesis rather than a derived result. The evaluation is therefore self-contained against external data splits and does not rely on self-citation chains, ansatzes, or renamings that would trigger any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Satellite imagery and derived embeddings contain information relevant to crop yield prediction
Reference graph
Works this paper leans on
-
[1]
Satellite-based assessment of yield variation and its determinants in smallholder African systems,
M. Burke and D. B. Lobell, “Satellite-based assessment of yield variation and its determinants in smallholder African systems,”Proceedings of the National Academy of Sciences, vol. 114, no. 9, pp. 2189–2194, 2017
work page 2017
-
[2]
D. B. Lobell, G. Azzari, M. Burke, S. Gourlay, Z. Jin, T. Kilic, and S. Murray, “Eyes in the sky, boots on the ground: Assessing satellite- and ground-based approaches to crop yield measurement and analysis,” American Journal of Agricultural Economics, vol. 102, no. 1, pp. 202– 219, 2020. 9
work page 2020
-
[3]
J. Jakubik, S. Roy, C. Phillips, P. Fraccaro, D. Godwin, B. Zadrozny, D. Szwarcman, C. Gomes, G. Nyirjesy, B. Edwardset al., “Foundation models for generalist geospatial artificial intelligence,”arXiv preprint arXiv:2310.18660, 2023
-
[4]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” pp. 16 000–16 009, 2022
work page 2022
-
[5]
An image is worth 16x16 words: Transformers for image recognition at scale,
A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”International Conference on Learning Representations, 2021
work page 2021
-
[6]
Deep Gaussian process for crop yield prediction based on remote sensing data,
J. You, X. Li, M. Low, D. Lobell, and S. Ermon, “Deep Gaussian process for crop yield prediction based on remote sensing data,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017
work page 2017
-
[7]
On the opportunities and challenges of foundation models for geospatial artificial intelligence,
G. Mai, N. Lao, Y . He, J. Song, and S. Ermon, “On the opportunities and challenges of foundation models for geospatial artificial intelligence,” arXiv preprint arXiv:2304.06798, 2023
-
[8]
Y . Wang, N. A. A. Braham, Z. Xiong, C. Liu, C. M. Albrecht, and X. X. Zhu, “SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation,”IEEE Geoscience and Remote Sensing Magazine, vol. 11, no. 3, pp. 98–106, 2023
work page 2023
-
[9]
Cropharvest: A global dataset for crop-type classification,
G. Tseng, I. Zvonkov, C. L. Nakalembe, and H. Kerner, “Cropharvest: A global dataset for crop-type classification,”Advances in Neural Information Processing Systems Datasets and Benchmarks, 2021
work page 2021
-
[10]
Rapid response crop maps in data sparse regions,
H. Kerner, G. Tseng, I. Becker-Reshef, B. Barker, B. Munshell, M. Paliyam, and M. Hosseini, “Rapid response crop maps in data sparse regions,”arXiv preprint arXiv:2006.16866, 2020
-
[11]
Domain adaptation for the classification of remote sensing data: An overview of recent advances,
D. Tuia, C. Persello, and L. Bruzzone, “Domain adaptation for the classification of remote sensing data: An overview of recent advances,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2, pp. 41–57, 2016
work page 2016
-
[12]
Estimating and understanding crop yields with explainable deep learning in the indian wheat belt,
A. Wolanin, G. Mateo-García, G. Camps-Valls, L. Gómez-Chova, M. Meroni, G. Duveiller, Y . Liangzhi, and L. Guanter, “Estimating and understanding crop yields with explainable deep learning in the indian wheat belt,”Environmental Research Letters, vol. 15, no. 2, p. 024019, 2020
work page 2020
-
[13]
Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,
O. Mañas, A. Lacoste, X. Giro-i Nieto, D. Vazquez, and P. Rodriguez, “Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,” pp. 9414–9423, 2021
work page 2021
-
[14]
GROW-Africa: A multi-country smallholder yield dataset for sub-Saharan Africa,
E. Hachbornet al., “GROW-Africa: A multi-country smallholder yield dataset for sub-Saharan Africa,”Scientific Data, vol. 11, 2024
work page 2024
-
[15]
HarvestStat Africa: A subnational crop production dataset for sub-Saharan Africa,
D. Leeet al., “HarvestStat Africa: A subnational crop production dataset for sub-Saharan Africa,”Scientific Data, vol. 12, 2025
work page 2025
-
[16]
Sentinel-2: ESA’s optical high-resolution mission for GMES operational services,
M. Drusch, U. Del Bello, S. Carlier, O. Colin, V . Fernandez, F. Gascon, B. Hoersch, C. Isola, P. Laberinti, P. Martimortet al., “Sentinel-2: ESA’s optical high-resolution mission for GMES operational services,”Remote Sensing of Environment, vol. 120, pp. 25–36, 2012
work page 2012
-
[17]
Monitoring vegetation systems in the Great Plains with ERTS,
J. Rouse, R. Haas, J. Schell, and D. Deering, “Monitoring vegetation systems in the Great Plains with ERTS,”NASA Special Publication, vol. 351, pp. 309–317, 1974
work page 1974
-
[18]
Development of a two- band enhanced vegetation index without a blue band,
Z. Jiang, A. R. Huete, K. Didan, and T. Miura, “Development of a two- band enhanced vegetation index without a blue band,”Remote Sensing of Environment, vol. 112, no. 10, pp. 3833–3845, 2008
work page 2008
-
[19]
C. Funk, P. Peterson, M. Landsfeld, D. Pedreros, J. Verdin, S. Shukla, G. Husak, J. Rowland, L. Harrison, A. Hoell, and J. Michaelsen, “The climate hazards infrared precipitation with stations — a new environmental record for monitoring extremes,”Scientific Data, vol. 2, p. 150066, 2015
work page 2015
-
[20]
Ridge regression: Biased estimation for nonorthogonal problems,
A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,”Technometrics, vol. 12, no. 1, pp. 55–67, 1970
work page 1970
-
[21]
L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, 2001
work page 2001
-
[22]
XGBoost: A scalable tree boosting system,
T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.