arxiv: 2604.23166 · v1 · submitted 2026-04-25 · 💻 cs.CY · cs.CV

Recognition: unknown

A satellite foundation model for improved wealth monitoring

Zhuo Zheng , Iv\'an Higuera-Mendieta , Richard Lee , David Newhouse , Talip Kilic , Stefano Ermon , Marshall Burke , David B. Lobell

Authors on Pith no claims yet

Pith reviewed 2026-05-08 07:12 UTC · model grok-4.3

classification 💻 cs.CY cs.CV

keywords satellite imagerywealth predictionfoundation modelself-supervised learningpoverty monitoringLandsattemporal analysiseconomic development

0 comments

The pith

A self-supervised satellite model predicts wealth levels and tracks changes over decades using sparse labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Tempov, a foundation model pretrained by self-supervision on three million bi-temporal Landsat image pairs. The pretrained features are then adapted through parameter-efficient fine-tuning to sparse household survey labels for predicting local wealth. The resulting system produces high-resolution wealth maps, supports zero-shot nowcasting and hindcasting up to a decade away, and tracks decadal changes while outperforming prior neural network and geospatial baselines. It also maintains competitive accuracy with only 10 percent of the usual survey samples and scales to a continent-wide Africa model with R squared of 0.63. A sympathetic reader would care because this approach reduces dependence on costly, infrequent surveys and enables timely monitoring of economic conditions in data-scarce regions.

Core claim

Tempov is a satellite foundation model pretrained by self-supervision on three million bi-temporal Landsat pairs and adapted with parameter-efficient fine-tuning to sparse survey labels. It enables large-scale, high-resolution wealth mapping and dynamic measurement, including zero-shot nowcasting up to a decade after observed labels, retrospective hindcasting, and decadal change tracking, while outperforming existing neural network and geospatial foundation-model baselines. In low-label regimes it achieves competitive accuracy with only 10 percent of survey samples, generalizes across countries, and produces a unified Africa-wide model with R squared of 0.63 and r squared of 0.68 from which

What carries the argument

Tempov, the self-supervised foundation model pretrained on bi-temporal Landsat pairs, which learns temporally robust features that transfer to wealth prediction when fine-tuned on limited labels.

If this is right

Supports zero-shot nowcasting of wealth up to a decade after the last training labels.
Enables retrospective hindcasting to estimate wealth in earlier periods without contemporary surveys.
Allows high-resolution tracking of wealth changes over a decade at continent scale.
Maintains competitive accuracy using only 10 percent of typical survey samples.
Generalizes to unified models for populous countries both inside and outside Africa.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The decadal maps could highlight intra-country economic divergence that national statistics obscure.
Similar pretraining on other satellite sources might extend the approach to additional development indicators.
Policy teams could combine the outputs with intervention data to evaluate local program impacts over time.
Lower label requirements might make repeated monitoring feasible in more countries than current surveys allow.

Load-bearing premise

That features learned from self-supervised pretraining on bi-temporal Landsat pairs remain predictive of wealth despite temporal distribution shifts, cross-country differences, and sparse labels without major interference from clouds, sensor changes, or economic shocks.

What would settle it

Apply the model to predict wealth in a held-out recent survey dataset from a new country or decade and check whether the correlation with ground-truth labels falls substantially below the reported R squared of 0.63.

read the original abstract

Poverty statistics guide social policy, but in many low- and middle-income countries, censuses and household surveys that collect these data are costly, infrequent, quickly outdated, and sometimes error-prone. Satellite imagery offers global coverage and the possibility of predicting economic livelihoods at scale, yet existing approaches to predicting livelihoods with imagery or other non-traditional data often fail to reliably identify local-level variation and, as we show, degrade under temporal shift. Here we introduce Tempov, a satellite foundation model pretrained by self-supervision on three million bi-temporal Landsat pairs and adapted with parameter-efficient fine-tuning to sparse survey labels. The model enables large-scale, high-resolution wealth mapping and dynamic measurement, including zero-shot nowcasting up to a decade after observed labels, retrospective hindcasting, and decadal change tracking, while outperforming existing neural network and geospatial foundation-model baselines. In low-label regimes, Tempov achieves competitive accuracy with only 10% of survey samples, indicating substantially reduced dependence on expensive label collection. The model further generalizes across populous countries within and outside Africa, and scales to a unified Africa-wide model with strong continent-level performance ($R^2=0.63$, $r^2=0.68$), from which we generate high-resolution decadal maps of wealth and wealth changes for the African continent. Analysis of these maps shows large variation in recent economic performance both within and across countries. Our open-source approach provides a pathway to timely, scalable, low-cost monitoring of wealth and poverty from routinely collected satellite data.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Tempov is a new Landsat foundation model that claims solid zero-shot temporal generalization for wealth prediction, but the validation details on time shifts are too thin to judge how much of that is real.

read the letter

The main thing to know is that this paper introduces Tempov, pretrained self-supervised on three million bi-temporal Landsat pairs, then adapted to sparse survey labels for wealth mapping. It reports an Africa-wide R^2 of 0.63, beats some baselines, works with just 10% of the labels, and produces decadal change maps. The zero-shot nowcasting and hindcasting up to a decade out is the part that stands out if it checks out.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces Tempov, a satellite foundation model pretrained via self-supervision on three million bi-temporal Landsat image pairs. After parameter-efficient fine-tuning on sparse household survey labels for wealth prediction, the model is claimed to support large-scale high-resolution wealth mapping, zero-shot nowcasting and hindcasting up to a decade apart, and decadal change tracking. It reports outperformance over neural network and geospatial foundation-model baselines, competitive accuracy with only 10% of survey samples, and strong continent-level results for an Africa-wide model (R²=0.63, r²=0.68), from which decadal wealth maps are generated. The approach is presented as reducing dependence on costly label collection while generalizing across countries.

Significance. If the temporal generalization and low-label claims hold under rigorous validation, the work could meaningfully advance scalable, low-cost socioeconomic monitoring in data-scarce regions by leveraging routinely collected satellite imagery. The open-source release and generated Africa-wide maps constitute concrete practical contributions that could support further research and policy use.

major comments (3)

[Abstract] Abstract: The central claims of zero-shot nowcasting/hindcasting up to a decade and decadal change tracking rest on the assumption that self-supervised features from bi-temporal pairs remain predictive under temporal shifts, yet no details are given on the temporal gaps between training labels and evaluation periods, the use of strict temporal hold-outs (train labels ≤ T, test labels ≥ T+Δ), or quantification of distribution shifts such as sensor changes or cloud cover.
[Results] Results on low-label regimes: The statement that Tempov achieves competitive accuracy with only 10% of survey samples is load-bearing for the reduced-label-dependence claim, but the manuscript does not specify whether the 10% subsets are randomly sampled, temporally stratified, or spatially clustered, nor whether error bars from multiple runs or ablation on label selection are reported.
[Africa-wide Model] Africa-wide model evaluation: The reported R²=0.63 and r²=0.68 for the unified continent-scale model would be strengthened by explicit reporting of cross-country generalization protocols (e.g., country-level hold-outs) versus within-country spatial splits, as spatial autocorrelation could otherwise inflate performance metrics.

minor comments (2)

[Abstract] The distinction between the reported R² and r² metrics for the Africa-wide model should be defined explicitly in the text or a table caption.
[Methods] A brief description of the exact self-supervised pretext task (e.g., contrastive or reconstruction objective on bi-temporal pairs) would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. The comments have helped us improve the clarity and rigor of our presentation, particularly regarding validation protocols. We provide point-by-point responses below and have revised the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: The central claims of zero-shot nowcasting/hindcasting up to a decade and decadal change tracking rest on the assumption that self-supervised features from bi-temporal pairs remain predictive under temporal shifts, yet no details are given on the temporal gaps between training labels and evaluation periods, the use of strict temporal hold-outs (train labels ≤ T, test labels ≥ T+Δ), or quantification of distribution shifts such as sensor changes or cloud cover.

Authors: We appreciate the referee's emphasis on rigorous temporal validation. The original submission indeed omitted some specifics on these protocols. In the revised manuscript, we have added a dedicated subsection in the Methods describing our temporal splitting strategy, including explicit temporal gaps of up to 10 years between pretraining/fine-tuning periods and evaluation. We confirm the use of strict temporal hold-outs (train on labels ≤ T, evaluate on ≥ T+Δ) and include quantitative analysis of distribution shifts, such as those arising from Landsat sensor changes (e.g., from TM to OLI) and variations in cloud cover, which our bi-temporal self-supervision helps mitigate. These details now support the zero-shot temporal generalization claims. revision: yes
Referee: [Results] Results on low-label regimes: The statement that Tempov achieves competitive accuracy with only 10% of survey samples is load-bearing for the reduced-label-dependence claim, but the manuscript does not specify whether the 10% subsets are randomly sampled, temporally stratified, or spatially clustered, nor whether error bars from multiple runs or ablation on label selection are reported.

Authors: We agree that additional details on the low-label regime experiments are warranted to fully substantiate the claims. The 10% subsets were randomly sampled from the full survey dataset, and performance is reported as averages with standard deviations over multiple (5) independent sampling runs to provide error bars. We have now included these details in the main text and added an ablation study in the supplementary information comparing random sampling to temporally stratified and spatially clustered selections. The results show that Tempov's advantage holds across sampling methods, with error bars confirming statistical reliability. revision: yes
Referee: [Africa-wide Model] Africa-wide model evaluation: The reported R²=0.63 and r²=0.68 for the unified continent-scale model would be strengthened by explicit reporting of cross-country generalization protocols (e.g., country-level hold-outs) versus within-country spatial splits, as spatial autocorrelation could otherwise inflate performance metrics.

Authors: This comment correctly identifies a potential issue with spatial autocorrelation in geospatial models. To address it, the revised manuscript now explicitly describes the evaluation protocol for the Africa-wide model: we perform both within-country spatial cross-validation and country-level hold-out experiments, where data from entire countries are withheld from training. We report the metrics separately, with the continent-level performance (R²=0.63, r²=0.68) holding under the stricter country hold-outs. This demonstrates that the results are not inflated by within-country spatial correlations and supports the generalization claims. revision: yes

Circularity Check

0 steps flagged

No significant circularity in derivation chain

full rationale

The paper's core pipeline—self-supervised pretraining on three million unlabeled bi-temporal Landsat pairs, followed by parameter-efficient fine-tuning on independent sparse survey labels and evaluation on held-out data—contains no self-definitional steps, no fitted inputs renamed as predictions, and no load-bearing self-citations that reduce the central claims to tautology. Claims of R^2=0.63, zero-shot temporal generalization, and low-label competitiveness are presented as empirical outcomes from held-out testing rather than quantities forced by construction or prior author results. The derivation remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the transferability of self-supervised features learned from bi-temporal satellite pairs to wealth prediction under temporal and spatial shifts; this is a domain assumption rather than a derived result.

axioms (1)

domain assumption Self-supervised pretraining on bi-temporal Landsat pairs learns features that remain predictive of wealth after temporal shifts when fine-tuned on sparse labels.
This assumption underpins the zero-shot nowcasting and hindcasting claims in the abstract.

pith-pipeline@v0.9.0 · 5599 in / 1464 out tokens · 51909 ms · 2026-05-08T07:12:28.194379+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 13 canonical work pages · 7 internal anchors

[1]

Africa’s statistical tragedy.Review of Income and Wealth59, S9–S15 (2013)

Devarajan, S. Africa’s statistical tragedy.Review of Income and Wealth59, S9–S15 (2013)

2013
[2]

Seidler, V.et al.Subnational variations in the quality of household survey data in sub-saharan africa.Nature Communications16, 3771 (2025)

2025
[3]

Burke, M., Driscoll, A., Lobell, D. B. & Ermon, S. Using satellite imagery to understand and pro- mote sustainable development.Science371, eabe8628 (2021)

2021
[4]

Yeh, C.et al.Using publicly available satellite imagery and deep learning to understand economic well-being in africa.Nature communications11, 2583 (2020). 7

2020
[5]

& Blumenstock, J

Chi, G., Fang, H., Chatterjee, S. & Blumenstock, J. E. Microestimates of wealth for all low-and middle-income countries.Proceedings of the National Academy of Sciences119, e2113658119 (2022)

2022
[6]

A post-2030 vision.Nature Sustainability8, 849–850 (2025)

2030
[7]

Jean, N.et al.Combining satellite imagery and machine learning to predict poverty.Science353, 790–794 (2016)

2016
[8]

Elbers, C., Lanjouw, J. O. & Lanjouw, P. Micro-level estimation of poverty and inequality.Econometrica71, 355–364 (2003)

2003
[9]

Can census data alone signal heterogene- ity in the estimation of poverty maps?Journal of Development Economics95, 170–185 (2011)

Tarozzi, A. Can census data alone signal heterogene- ity in the estimation of poverty maps?Journal of Development Economics95, 170–185 (2011)

2011
[10]

& Newhouse, D

Engstrom, R., Hersh, J. & Newhouse, D. Poverty from space: Using high resolution satellite imagery for estimating economic well-being.The World Bank Economic Review36, 382–412 (2022)

2022
[11]

V., Storeygard, A

Henderson, J. V., Storeygard, A. & Weil, D. N. Mea- suring economic growth from outer space.American economic review102, 994–1028 (2012)

2012
[12]

& Nordhaus, W

Chen, X. & Nordhaus, W. D. Using luminosity data as a proxy for economic statistics.Proceedings of the National Academy of Sciences108, 8589–8594 (2011)

2011
[13]

& Swartz, T

Babenko, B., Hersh, J., Newhouse, D., Ramakrishnan, A. & Swartz, T. Poverty mapping using convolutional neural networks trained on high and medium resolution satellite images, with an application in mexico.arXiv preprint arXiv:1711.06323(2017)

work page arXiv 2017
[14]

XGBoost: A Scalable Tree Boosting System

Chen, T. Xgboost: A scalable tree boosting system. arXiv preprint arXiv:1603.02754(2016)

work page Pith review arXiv 2016
[15]

Blumenstock, J., Cadamuro, G. & On, R. Predict- ing poverty and wealth from mobile phone metadata. Science350, 1073–1076 (2015)

2015
[16]

Small area estimation of poverty and wealth using geospatial data: What have we learned so far?Calcutta Statistical Association Bulletin76, 7–32 (2024)

Newhouse, D. Small area estimation of poverty and wealth using geospatial data: What have we learned so far?Calcutta Statistical Association Bulletin76, 7–32 (2024)

2024
[17]

& Ermon, S

Ayush, K., Uzkent, B., Burke, M., Lobell, D. & Ermon, S. Generating interpretable poverty maps using object detection in satellite images.arXiv preprint arXiv:2002.01612(2020)

work page arXiv 2002
[18]

B., Kakooei, M., Ortheden, J., Johans- son, F

Pettersson, M. B., Kakooei, M., Ortheden, J., Johans- son, F. D. & Daoud, A. Time series of satellite imagery improve deep learning estimates of neighborhood-level poverty in africa.IJCAI6165–6173 (2023)

2023
[19]

Zheng, Z.et al.Dynamic, high-resolution poverty measurement in data-scarce environments.Journal of Development Economics103691 (2025)

2025
[20]

& Lahiri, P

Newhouse, D., Ramakrishnan, A., Swartz, T., Merfeld, J. & Lahiri, P. Small area estimation of monetary poverty in mexico using satellite imagery and machine learning.Oxford Bulletin of Economics and Statistics 87, 1158–1172 (2025)

2025
[21]

International Conference on Learning Representations (2021)

Dosovitskiy, A.et al.An image is worth 16x16 words: Transformers for image recognition at scale. International Conference on Learning Representations (2021)

2021
[22]

Tucker, C. J. Red and photographic infrared linear com- binations for monitoring vegetation.Remote sensing of Environment8, 127–150 (1979)

1979
[23]

The landsat etm+ spectral mixing space

Small, C. The landsat etm+ spectral mixing space. Remote sensing of Environment93, 1–17 (2004)

2004
[24]

Brown, M. E. & Funk, C. C. Food security under climate change.Science319, 580–581 (2008)

2008
[25]

Lobell, D. B. The use of satellite data for crop yield gap analysis.Field Crops Research143, 56–64 (2013)

2013
[26]

Cong, Y.et al.Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery.Advances in Neural Information Processing Systems35, 197–211 (2022)

2022
[27]

Sim´ eoni, O.et al.Dinov3.arXiv preprint arXiv:2508.10104(2025)

work page internal anchor Pith review arXiv 2025
[28]

Stewart, A.et al.Ssl4eo-l: Datasets and foundation models for landsat imagery.Advances in Neural Infor- mation Processing Systems36, 59787–59807 (2023)

2023
[29]

Xiong, Z.et al.Neural plasticity-inspired multimodal foundation model for earth observation.arXiv preprint arXiv:2403.15356(2024)

work page arXiv 2024
[30]

Szwarcman, D.et al.Prithvi-eo-2.0: A versatile multi-temporal foundation model for earth observation applications.IEEE Transactions on Geoscience and Remote Sensing(2025)

2025
[31]

Clay foundation model (2024)

Clay Foundation. Clay foundation model (2024). URL https://github.com/Clay-foundation/model. GitHub repository

2024
[32]

J.et al.Lora: Low-rank adaptation of large language models.ICLR1, 3 (2022)

Hu, E. J.et al.Lora: Low-rank adaptation of large language models.ICLR1, 3 (2022)

2022
[33]

& Subramanian, A

Patel, D., Sandefur, J. & Subramanian, A. We were wrong about convergence. ChatGDP Blog (2025). URL https://www.chat-gdp.org/ we-were-wrong-about-convergence/. Accessed: 8 2026-04-22

2025
[34]

& Pritchett, L

Filmer, D. & Pritchett, L. H. Estimating wealth effects without expenditure data—or tears: an application to educational enrollments in states of india.Demography 38, 115–132 (2001)

2001
[35]

Sahn, D. E. & Stifel, D. Exploring alternative measures of welfare in the absence of expenditure data.Review of income and wealth49, 463–489 (2003)

2003
[36]

Colston, J. M.et al.Spatial variation in housing con- struction material in low-and middle-income countries: A bayesian spatial prediction model of a key infectious diseases risk factor and social determinant of health. PLOS Global Public Health4, e0003338 (2024)

2024
[37]

Marshall, M. G. & Gurr, T. R.Polity5: Political Regime Characteristics and Transitions, 1800–2018. Center for Systemic Peace, Vienna, VA (2020). URL https:// www.systemicpeace.org/inscrdata.html. Dataset Users’ Manual

2018
[38]

& Melander, E

Sundberg, R. & Melander, E. Introducing the UCDP georeferenced event dataset.Journal of Peace Research 50, 523–532 (2013)

2013
[39]

Hersbach, H.et al.The ERA5 global reanalysis.Quar- terly Journal of the Royal Meteorological Society146, 1999–2049 (2020)

1999
[40]

Caron, M.et al.Emerging properties in self-supervised vision transformers.Proceedings of the IEEE/CVF international conference on computer vision9650–9660 (2021)

2021
[41]

Zhou, J.et al.ibot: Image bert pre-training with online tokenizer.arXiv preprint arXiv:2111.07832(2021)

work page internal anchor Pith review arXiv 2021
[42]

Assran, M.et al.Self-supervised learning from images with a joint-embedding predictive architecture.Pro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition15619–15629 (2023)

2023
[43]

& J´ egou, H

Sablayrolles, A., Douze, M., Schmid, C. & J´ egou, H. Spreading vectors for similarity search.arXiv preprint arXiv:1806.03198(2018)

work page arXiv 2018
[44]

Decoupled Weight Decay Regularization

Loshchilov, I. & Hutter, F. Decoupled weight decay reg- ularization.arXiv preprint arXiv:1711.05101(2017)

work page internal anchor Pith review arXiv 2017
[45]

FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning

Dao, T. Flashattention-2: Faster attention with bet- ter parallelism and work partitioning.arXiv preprint arXiv:2307.08691(2023)

work page internal anchor Pith review arXiv 2023
[46]

Zhao, Y.et al.Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277 (2023)

work page internal anchor Pith review arXiv 2023
[47]

& Weinberger, K

Huang, G., Sun, Y., Liu, Z., Sedra, D. & Weinberger, K. Q. Deep networks with stochastic depth.European conference on computer vision646–661 (2016)

2016
[48]

Nix, D. A. & Weigend, A. S. Estimating the mean and variance of the target probability distribution.Proceed- ings of 1994 ieee international conference on neural networks (ICNN’94)1, 55–60 (1994)

1994
[49]

Hunter, J. D. Matplotlib: A 2d graphics environment. Computing in Science & Engineering9, 90–95 (2007)

2007
[50]

Zen- odo (2020)

Jordahl, K.et al.geopandas/geopandas: v0.8.1 (2020). URL https://doi.org/10.5281/zenodo.3946761

work page doi:10.5281/zenodo.3946761 2020
[51]

Layer Normalization

Ba, J. L., Kiros, J. R. & Hinton, G. E. Layer normal- ization.arXiv preprint arXiv:1607.06450(2016)

work page internal anchor Pith review arXiv 2016
[52]

Vaswani, A.et al.Attention is all you need.Advances in neural information processing systems30(2017)

2017
[53]

Gaussian Error Linear Units (GELUs)

Hendrycks, D. & Gimpel, K. Gaussian error linear units (gelus).arXiv preprint arXiv:1606.08415(2016)

work page internal anchor Pith review arXiv 2016
[54]

Su, J.et al.Roformer: Enhanced transformer with rotary position embedding.Neurocomputing568, 127063 (2024). 9 Methods Pretraining satellite imagery To enable Tempov to learn temporally invariant repre- sentations that are robust to phenological variation, we curated a large-scale bi-temporal Landsat dataset from the SSL4EO-L archive.28 The dataset compris...

2024
[55]

logp θs(v).(4) where Vs = {xg 2} ∪ V ℓ t . Here, pθ(·) = hDINO θ (gθ(·)) denotes the class-token probability distribution produced by Tempov backbone gθ followed by a DINO head hDINO θ .40 Accordingly, Lbi-DINO is the cross-entropy loss from the teacher prediction on xg 1 to the student predictions over all target views in Vs, enforcing temporal consisten...
[56]

logp ′ θs(xg 2).(5) where p′ θ(·) = hiBOT θ (gθ(·)) denotes the patch-token prob- ability distribution produced by the Tempov backbone gθ followed by an iBOT head hiBOT θ .41 In this objective, the student predicts masked patch tokens to match the teacher’s unmasked outputs across seasonal views, encouraging tem- porally invariant representations and lear...
[57]

retrieves

to evaluate five generalization scenarios. Data are grouped by country and year, then split into spatially disjoint folds within each country-year entry (illustrated in Supplementary Fig. 1). Let C = {c1, c2, ..., cn} denote the country set ( n = 34), with target country A = {ci}, other countries B = C−A, target year T1, and all other yearsT2. We evaluate...

2020