pith. sign in

arxiv: 2605.08113 · v1 · submitted 2026-04-27 · 💻 cs.LG · cs.CV

Do Foundation Model Embeddings Improve Cross-Country Crop Yield Generalisation? A Leave-One-Country-Out Evaluation in Sub-Saharan Africa

Pith reviewed 2026-05-12 01:23 UTC · model grok-4.3

classification 💻 cs.LG cs.CV
keywords maize yield predictionfoundation modelsremote sensingcross-country generalizationsub-Saharan AfricaSentinel-2Prithvi-EOdistribution shift
0
0 comments X

The pith

Frozen embeddings from geospatial foundation models provide no advantage over traditional spectral features for predicting maize yields across different countries.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether embeddings from large geospatial foundation models can generalise smallholder maize yield predictions from one African country to others better than basic satellite-derived spectral features. It collects 6404 field observations across five sub-Saharan countries and applies a leave-one-country-out scheme that trains on four countries and tests on the fifth to simulate real cross-border use. All feature sets, including the foundation model embeddings, produce negative R-squared values under this protocol even though random within-country splits show moderate performance. The authors identify shifts in the distribution of actual yield values between countries as the dominant barrier rather than any shortfall in how the embeddings represent the imagery.

Core claim

Under leave-one-country-out testing on 6404 observations from five countries, within-country random cross-validation produces moderate R-squared values while cross-country evaluation yields negative R-squared for every tested feature set, including frozen Prithvi-EO and ViT-Base embeddings, which show no meaningful gain over engineered Sentinel-2 spectral features; the performance collapse is attributed primarily to shifts in yield distributions across countries rather than representation quality.

What carries the argument

The leave-one-country-out cross-validation protocol that withholds all observations from one country for testing while training on the remaining four countries, applied to compare frozen foundation model embeddings against hand-engineered spectral features from Sentinel-2 imagery.

If this is right

  • Within-country random splits substantially overestimate the generalisability of yield prediction models to new countries.
  • Engineered spectral features from Sentinel-2 perform at least as well as frozen Prithvi-EO embeddings under strict cross-country conditions.
  • All current representation approaches fail to produce useful predictions when yield levels differ markedly between training and test countries.
  • Releasing the negative benchmark enables future work to measure progress against a reproducible cross-country baseline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Explicit correction for country-level yield shifts, such as mean adjustment or domain adaptation layers, could be tested as a direct follow-up to isolate the effect.
  • The same leave-one-country-out design applied to other regions or crops would reveal whether distribution shifts are a general obstacle in agricultural remote sensing.
  • Fine-tuning the foundation model weights instead of keeping them frozen might alter the relative performance of embeddings versus spectral features.

Load-bearing premise

The assumption that the observed collapse in cross-country performance is driven mainly by differences in yield distributions between countries rather than by variations in data quality, sensor characteristics, or unaccounted environmental factors.

What would settle it

If matching the statistical distribution of yields between the training countries and the held-out country restored positive predictive performance while keeping all other factors fixed, that would indicate distribution shift is not the primary cause.

Figures

Figures reproduced from arXiv: 2605.08113 by Yaw Osei Adjei.

Figure 1
Figure 1. Figure 1: Within-country (random CV) vs. cross-country (LOCO) [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-country RMSE for the best model per feature set [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Generalisation gap (random CV R2 minus LOCO R2 ) per feature–regressor combination. Ridge consistently shows smaller gaps than tree ensembles despite lower within-country accuracy. provide some signal beyond the simplest possible predictor. However, both learned models and the naive baseline lie below zero R2 , confirming that predicting held-out countries from other-country data is a fundamentally difficu… view at source ↗
Figure 5
Figure 5. Figure 5: Per-country LOCO RMSE (kg/ha) for all nine conditions, [PITH_FULL_IMAGE:figures/full_fig_p006_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Predicted vs. actual yield scatter under LOCO for the Prithvi-EO / Ridge condition (one panel per held-out country). [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 9
Figure 9. Figure 9: Mean LOCO R2 ± one standard deviation across the five LOCO country folds, by feature set and regressor. Per-fold standard deviations (0.32–0.70) dwarf the between-condition differences (<0.07), indicating that no condition is statistically separable from any other. NDVI Only Spectral (10-band) Prithvi-EO (768-dim) ViT-Base (768-dim) 0.5 0.4 0.3 0.2 0.1 0.0 LOCO R² Feature Ablation: NDVI-Only vs Spectral vs… view at source ↗
Figure 10
Figure 10. Figure 10: Feature ablation under LOCO: a single NDVI feature [PITH_FULL_IMAGE:figures/full_fig_p007_10.png] view at source ↗
read the original abstract

Accurate predictions of smallholder maize yields across national boundaries are critical for food security planning in sub-Saharan Africa, yet most published benchmarks report within-country performance that overstates true generalisability. This paper evaluates whether geospatial foundation model embeddings, specifically Prithvi-EO-1.0-100M and ViT-Base, outperform traditional Sentinel-2 spectral features under a Leave-One-Country-Out cross-validation scheme on 6,404 maize field observations from five African countries. The results show a clear generalisability gap: within-country random cross-validation yields moderate R^2 values, but all feature sets perform poorly under cross-country testing, with universally negative R^2. Frozen Prithvi-EO embeddings provide no meaningful advantage over engineered spectral features for cross-country prediction in this setting. The paper argues that the main limitation is a shift in yield distribution between countries rather than representation quality and releases a reproducible negative benchmark for future work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates whether embeddings from geospatial foundation models (Prithvi-EO-1.0-100M and ViT-Base) improve cross-country generalization for predicting smallholder maize yields in Sub-Saharan Africa compared to engineered Sentinel-2 spectral features. Using leave-one-country-out cross-validation on 6,404 field observations from five countries, it reports moderate within-country R² but universally negative R² in cross-country settings for all feature sets, concluding that frozen Prithvi-EO embeddings offer no advantage and that yield distribution shifts across countries are the main limitation rather than representation quality. A reproducible negative benchmark is released.

Significance. If substantiated, this work offers an important negative result for the application of foundation models in agricultural remote sensing, emphasizing the dominance of distribution shifts in limiting generalization. The provision of a reproducible benchmark is a positive contribution that can guide future research on handling cross-domain challenges in crop yield prediction.

major comments (2)
  1. [Abstract] Abstract: The claim that 'the main limitation is a shift in yield distribution between countries rather than representation quality' lacks isolating evidence, as no per-country yield histograms, label variance statistics, cloud-cover metrics, or ablation holding yield range fixed while varying only the feature extractor are reported.
  2. [Evaluation setup] Evaluation setup: The LOCO performance collapse is shown for all feature sets, but without controls for country-specific data artifacts (label noise, sampling protocols, or sensor conditions), the causal attribution to yield shifts alone is under-supported and load-bearing for the interpretation that representation quality is not the issue.
minor comments (2)
  1. [Methods] Methods: Expand details on per-country field counts, exact Sentinel-2 band combinations for engineered features, and cloud-masking preprocessing to improve reproducibility of the negative benchmark.
  2. [Results] Results: Report error bars or standard deviations on R² values across CV folds to better quantify the within-country vs. cross-country performance gap.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback, which highlights opportunities to strengthen the evidential basis for our interpretation of the results. We address each major comment below and will incorporate revisions to improve clarity and support for the claims.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The claim that 'the main limitation is a shift in yield distribution between countries rather than representation quality' lacks isolating evidence, as no per-country yield histograms, label variance statistics, cloud-cover metrics, or ablation holding yield range fixed while varying only the feature extractor are reported.

    Authors: We agree that additional visualizations and statistics would better isolate the role of yield distribution shifts. In the revised manuscript we will add per-country yield histograms, label variance statistics, and average cloud-cover metrics derived from the Sentinel-2 acquisitions. An ablation that strictly holds yield range fixed while varying only the feature extractor is difficult to implement without introducing selection bias or synthetic labels; we will instead add a dedicated limitations paragraph discussing this constraint and explaining why the uniform negative R² across all feature families (engineered spectral, Prithvi, ViT) still supports our conclusion that representation quality is not the primary bottleneck. revision: partial

  2. Referee: [Evaluation setup] Evaluation setup: The LOCO performance collapse is shown for all feature sets, but without controls for country-specific data artifacts (label noise, sampling protocols, or sensor conditions), the causal attribution to yield shifts alone is under-supported and load-bearing for the interpretation that representation quality is not the issue.

    Authors: The LOCO protocol is chosen precisely because it reflects operational cross-border prediction; the fact that every representation family collapses similarly indicates the failure is not representation-specific. The underlying dataset follows a harmonized collection protocol across countries (detailed in Section 3), which reduces but does not eliminate possible artifacts. We will expand the evaluation-setup subsection to include available metadata on label-collection procedures and sensor conditions, and we will add an explicit discussion of residual country-specific factors. While we cannot retroactively introduce new controls that were not part of the original data collection, the consistent pattern across feature types remains the strongest available evidence that domain shift, rather than representation quality, drives the observed generalization gap. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct CV comparisons

full rationale

The paper reports an empirical leave-one-country-out evaluation of feature sets (Prithvi-EO embeddings vs. Sentinel-2 spectral features) on 6,404 real maize yield observations, producing standard performance metrics such as R² under within-country and cross-country splits. No equations, derivations, or parameter-fitting steps are described that could reduce a claimed prediction to a quantity defined by the inputs themselves. The central observation (negative cross-country R² for all methods) follows directly from applying off-the-shelf models to held-out country data; the attribution to yield distribution shift is presented as an interpretive hypothesis rather than a derived result. The evaluation is therefore self-contained against external data splits and does not rely on self-citation chains, ansatzes, or renamings that would trigger any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that satellite-derived features can be used to predict yields and that the 6,404 observations adequately sample the cross-country distribution shifts.

axioms (1)
  • domain assumption Satellite imagery and derived embeddings contain information relevant to crop yield prediction
    Invoked when using spectral features and foundation model embeddings as inputs for yield regression.

pith-pipeline@v0.9.0 · 5469 in / 1231 out tokens · 46808 ms · 2026-05-12T01:23:41.921413+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages

  1. [1]

    Satellite-based assessment of yield variation and its determinants in smallholder African systems,

    M. Burke and D. B. Lobell, “Satellite-based assessment of yield variation and its determinants in smallholder African systems,”Proceedings of the National Academy of Sciences, vol. 114, no. 9, pp. 2189–2194, 2017

  2. [2]

    Eyes in the sky, boots on the ground: Assessing satellite- and ground-based approaches to crop yield measurement and analysis,

    D. B. Lobell, G. Azzari, M. Burke, S. Gourlay, Z. Jin, T. Kilic, and S. Murray, “Eyes in the sky, boots on the ground: Assessing satellite- and ground-based approaches to crop yield measurement and analysis,” American Journal of Agricultural Economics, vol. 102, no. 1, pp. 202– 219, 2020. 9

  3. [3]

    org/abs/2310.18660

    J. Jakubik, S. Roy, C. Phillips, P. Fraccaro, D. Godwin, B. Zadrozny, D. Szwarcman, C. Gomes, G. Nyirjesy, B. Edwardset al., “Foundation models for generalist geospatial artificial intelligence,”arXiv preprint arXiv:2310.18660, 2023

  4. [4]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Dollár, and R. Girshick, “Masked autoencoders are scalable vision learners,” pp. 16 000–16 009, 2022

  5. [5]

    An image is worth 16x16 words: Transformers for image recognition at scale,

    A. Dosovitskiy, L. Beyer, A. Kolesnikov, D. Weissenborn, X. Zhai, T. Unterthiner, M. Dehghani, M. Minderer, G. Heigold, S. Gellyet al., “An image is worth 16x16 words: Transformers for image recognition at scale,”International Conference on Learning Representations, 2021

  6. [6]

    Deep Gaussian process for crop yield prediction based on remote sensing data,

    J. You, X. Li, M. Low, D. Lobell, and S. Ermon, “Deep Gaussian process for crop yield prediction based on remote sensing data,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 31, no. 1, 2017

  7. [7]

    On the opportunities and challenges of foundation models for geospatial artificial intelligence,

    G. Mai, N. Lao, Y . He, J. Song, and S. Ermon, “On the opportunities and challenges of foundation models for geospatial artificial intelligence,” arXiv preprint arXiv:2304.06798, 2023

  8. [8]

    SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation,

    Y . Wang, N. A. A. Braham, Z. Xiong, C. Liu, C. M. Albrecht, and X. X. Zhu, “SSL4EO-S12: A large-scale multimodal, multitemporal dataset for self-supervised learning in earth observation,”IEEE Geoscience and Remote Sensing Magazine, vol. 11, no. 3, pp. 98–106, 2023

  9. [9]

    Cropharvest: A global dataset for crop-type classification,

    G. Tseng, I. Zvonkov, C. L. Nakalembe, and H. Kerner, “Cropharvest: A global dataset for crop-type classification,”Advances in Neural Information Processing Systems Datasets and Benchmarks, 2021

  10. [10]

    Rapid response crop maps in data sparse regions,

    H. Kerner, G. Tseng, I. Becker-Reshef, B. Barker, B. Munshell, M. Paliyam, and M. Hosseini, “Rapid response crop maps in data sparse regions,”arXiv preprint arXiv:2006.16866, 2020

  11. [11]

    Domain adaptation for the classification of remote sensing data: An overview of recent advances,

    D. Tuia, C. Persello, and L. Bruzzone, “Domain adaptation for the classification of remote sensing data: An overview of recent advances,” IEEE Geoscience and Remote Sensing Magazine, vol. 4, no. 2, pp. 41–57, 2016

  12. [12]

    Estimating and understanding crop yields with explainable deep learning in the indian wheat belt,

    A. Wolanin, G. Mateo-García, G. Camps-Valls, L. Gómez-Chova, M. Meroni, G. Duveiller, Y . Liangzhi, and L. Guanter, “Estimating and understanding crop yields with explainable deep learning in the indian wheat belt,”Environmental Research Letters, vol. 15, no. 2, p. 024019, 2020

  13. [13]

    Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,

    O. Mañas, A. Lacoste, X. Giro-i Nieto, D. Vazquez, and P. Rodriguez, “Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,” pp. 9414–9423, 2021

  14. [14]

    GROW-Africa: A multi-country smallholder yield dataset for sub-Saharan Africa,

    E. Hachbornet al., “GROW-Africa: A multi-country smallholder yield dataset for sub-Saharan Africa,”Scientific Data, vol. 11, 2024

  15. [15]

    HarvestStat Africa: A subnational crop production dataset for sub-Saharan Africa,

    D. Leeet al., “HarvestStat Africa: A subnational crop production dataset for sub-Saharan Africa,”Scientific Data, vol. 12, 2025

  16. [16]

    Sentinel-2: ESA’s optical high-resolution mission for GMES operational services,

    M. Drusch, U. Del Bello, S. Carlier, O. Colin, V . Fernandez, F. Gascon, B. Hoersch, C. Isola, P. Laberinti, P. Martimortet al., “Sentinel-2: ESA’s optical high-resolution mission for GMES operational services,”Remote Sensing of Environment, vol. 120, pp. 25–36, 2012

  17. [17]

    Monitoring vegetation systems in the Great Plains with ERTS,

    J. Rouse, R. Haas, J. Schell, and D. Deering, “Monitoring vegetation systems in the Great Plains with ERTS,”NASA Special Publication, vol. 351, pp. 309–317, 1974

  18. [18]

    Development of a two- band enhanced vegetation index without a blue band,

    Z. Jiang, A. R. Huete, K. Didan, and T. Miura, “Development of a two- band enhanced vegetation index without a blue band,”Remote Sensing of Environment, vol. 112, no. 10, pp. 3833–3845, 2008

  19. [19]

    The climate hazards infrared precipitation with stations — a new environmental record for monitoring extremes,

    C. Funk, P. Peterson, M. Landsfeld, D. Pedreros, J. Verdin, S. Shukla, G. Husak, J. Rowland, L. Harrison, A. Hoell, and J. Michaelsen, “The climate hazards infrared precipitation with stations — a new environmental record for monitoring extremes,”Scientific Data, vol. 2, p. 150066, 2015

  20. [20]

    Ridge regression: Biased estimation for nonorthogonal problems,

    A. E. Hoerl and R. W. Kennard, “Ridge regression: Biased estimation for nonorthogonal problems,”Technometrics, vol. 12, no. 1, pp. 55–67, 1970

  21. [21]

    Random forests,

    L. Breiman, “Random forests,”Machine Learning, vol. 45, no. 1, pp. 5–32, 2001

  22. [22]

    XGBoost: A scalable tree boosting system,

    T. Chen and C. Guestrin, “XGBoost: A scalable tree boosting system,” inProceedings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, 2016, pp. 785–794