Better Together: Evaluating the Complementarity of Earth Embedding Models
Pith reviewed 2026-05-20 11:47 UTC · model grok-4.3
The pith
Fusing multiple Earth embedding models improves performance over the best single model on four of six downstream tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
When four Earth embedding models are fused, the resulting embeddings outperform the best single model in four out of six downstream tasks. The performance gain, termed complementarity, is both task-dependent and location-dependent. In a land-cover regression task the degree of complementarity is partly explained by the spatial scale of the land-cover classes.
What carries the argument
The embedding complementarity index, which measures the task-performance gain obtained by fusing two or more embeddings over the strongest single embedding.
If this is right
- Single-embedding evaluations underestimate overall capabilities of the Earth embedding family.
- Complementarity strength differs across downstream tasks.
- Complementarity also differs across geographic locations.
- For land-cover regression the spatial scale of classes helps predict how much fusion helps.
- The largest future gains may come from combinations rather than from any single improved model.
Where Pith is reading between the lines
- Model developers could aim to train embeddings that deliberately fill gaps left by existing ones instead of competing only on overall accuracy.
- Benchmarks could routinely include fused results so that reported numbers reflect the best achievable performance from the available set.
- Task designers might choose locations or class scales where complementarity is known to be high in order to maximize gains from fusion.
Load-bearing premise
The fusion step used to join embeddings is a neutral operation that reveals true complementarity rather than creating its own performance artifacts.
What would settle it
A replication on the same tasks and models that applies multiple distinct fusion methods and finds that the performance advantage over single models disappears under every method.
Figures
read the original abstract
Earth embedding models transform Earth observation data into embeddings uniquely tied to locations on the Earth's surface. These models are typically evaluated in isolation, comparing the downstream task performance across different Earth embeddings. However, spatially aligned embeddings can naturally be fused, providing richer information per location, a capability that isolated evaluations fail to capture. We therefore propose assessing Earth embeddings by their complementarity: the performance gain of fused embeddings over the best single-model baseline. To operationalise this, we introduce an embedding complementarity index applicable to any embedding and task, and evaluate four Earth embedding models (AlphaEarth, Tessera, GeoCLIP, SatCLIP) in isolation, in all pairs, and jointly across six downstream tasks. Fused embeddings outperform the best single model in four out of six tasks, confirming that single-embedding evaluations often underestimate Earth embedding capabilities. Complementarity proves both task- and location-dependent. Further, for a land cover regression task, we find that complementarity is partially determined by the spatial scale of land cover classes. Complementarity reframes Earth embeddings: the greatest future gains may come not from any single Earth embedding model, but from combinations that are better together.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper argues that Earth embedding models should be assessed via their complementarity when fused rather than evaluated in isolation. It introduces an embedding complementarity index, evaluates four models (AlphaEarth, Tessera, GeoCLIP, SatCLIP) in isolation, pairs, and jointly on six downstream tasks, and reports that fused embeddings outperform the best single-model baseline in four out of six tasks, with complementarity being task- and location-dependent.
Significance. If the empirical results hold after controlling for dimensionality and providing statistical details, the work offers a constructive reframing of embedding evaluation in Earth observation, suggesting that future progress may lie in model combinations. The complementarity index is a potentially reusable contribution, though its parameter-free status and general applicability require clearer demonstration.
major comments (2)
- [Experimental evaluation] Evaluation protocol (described in the abstract and experimental sections): the direct comparison of fused (concatenated) embeddings against single-model baselines lacks dimension-matched controls such as PCA-reduced fusions or random-feature baselines of equivalent dimensionality. This leaves open the possibility that reported gains in 4/6 tasks arise from increased input capacity rather than genuine complementarity, which is load-bearing for the central claim.
- [Results] Results reporting: the manuscript states clear outperformance in four out of six tasks but provides no error bars, statistical significance tests, or details on data splits and fusion implementation. Without these, the reliability of the 4/6 count and the complementarity index cannot be verified.
minor comments (2)
- [Methods] The definition and computation of the embedding complementarity index should be given explicitly with a formula or pseudocode to allow reproduction.
- [Methods] Clarify whether fusion is simple concatenation or includes any normalization/alignment steps, and discuss sensitivity to these choices.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments, which help strengthen the rigor of our evaluation of Earth embedding complementarity. We address each major comment below and will incorporate revisions to address the concerns raised.
read point-by-point responses
-
Referee: [Experimental evaluation] Evaluation protocol (described in the abstract and experimental sections): the direct comparison of fused (concatenated) embeddings against single-model baselines lacks dimension-matched controls such as PCA-reduced fusions or random-feature baselines of equivalent dimensionality. This leaves open the possibility that reported gains in 4/6 tasks arise from increased input capacity rather than genuine complementarity, which is load-bearing for the central claim.
Authors: We agree that dimension-matched controls are essential to isolate genuine complementarity from the effects of increased feature dimensionality. Although our primary experiments demonstrate the value of direct fusion, the revised manuscript will include additional baselines: (1) PCA-reduced versions of the concatenated embeddings projected to match the dimensionality of the strongest single-model baseline, and (2) random projection baselines of equivalent dimensionality. These controls will be reported for all tasks to confirm that observed gains in four of six tasks arise from complementary information rather than capacity alone. revision: yes
-
Referee: [Results] Results reporting: the manuscript states clear outperformance in four out of six tasks but provides no error bars, statistical significance tests, or details on data splits and fusion implementation. Without these, the reliability of the 4/6 count and the complementarity index cannot be verified.
Authors: We appreciate this feedback on reproducibility. The revised manuscript will expand the experimental section to include: error bars (standard deviation across repeated runs or cross-validation folds), statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing fused versus best single-model performance, and detailed descriptions of data splits, preprocessing steps prior to fusion, the exact concatenation procedure, and implementation details (including any normalization or handling of missing values). These additions will allow independent verification of the 4/6 outperformance count and the complementarity index. revision: yes
Circularity Check
No circularity: direct empirical comparisons of fused vs. single embeddings
full rationale
The paper's derivation consists of an empirical protocol that measures downstream task performance for single embeddings, all pairwise fusions, and the joint fusion across six tasks, then reports the gain of the fused case over the best single baseline. The complementarity index is defined directly as this observed performance difference and is computed on independent task evaluations; no equations reduce the reported gains to fitted parameters defined from the same data, no self-citations serve as load-bearing uniqueness theorems, and no ansatz or renaming of prior results is invoked to derive the central claim. The evaluation is self-contained and replicable from the stated experimental setup.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The six downstream tasks are representative of real-world uses of Earth embeddings.
invented entities (1)
-
Embedding complementarity index
no independent evidence
Reference graph
Works this paper leans on
-
[1]
URLhttps://arxiv.org/abs/2204.01678. James GC Ball, Jana Annika Wicklein, Zhengpeng Feng, Jovana Knezevic, Sadiq Jaffer, Anil Madhavapeddy, Clement Atzberger, Michele Dalponte, and David Coomes. Geospatial foundation models enable data-efficient tree species mapping in temperate mountain forests.bioRxiv,
-
[2]
doi: 10.64898/2026.02.23.707022. Christopher F Brown, Steven P Brumby, Brookie Guzder-Williams, Tanya Birch, Samantha Brooks Hyde, Joseph Mazzariello, Wanda Czerwinski, Valerie J Pasquarella, Robert Haertel, Simon Ilyushchenko, et al. Dynamic world, near real-time global 10 m land use land cover mapping.Scientific data, 9(1):251,
-
[3]
Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291,
work page internal anchor Pith review Pith/arXiv arXiv
-
[4]
ISSN 2374-0353. doi: 10.1145/3787217. URL https://doi.org/10.1145/ 3787217. Aayush Dhakal, Srikumar Sastry, Subash Khanal, Adeel Ahmad, Eric Xing, and Nathan Jacobs. Range: Retrieval augmented neural fields for multi-resolution geo-embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24680–24689, June
-
[5]
Heng Fang, Adam J Stewart, Isaac Corley, Xiao Xiang Zhu, and Hossein Azizpour. Earth embeddings as products: Taxonomy, ecosystem, and standardized access.arXiv preprint arXiv:2601.13134,
-
[6]
TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis
Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline C Lisaius, Markus Immitzer, Toby Jackson, James Ball, et al. Tessera: Temporal embeddings of surface spectra for earth representation and analysis.arXiv preprint arXiv:2506.20380,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Lucia Gordon, Serge Belongie, Christian Igel, and Nico Lang. Mmearth-bench: Global model adaptation via multimodal test-time training.arXiv preprint arXiv:2602.06285,
-
[8]
Fabian Gröger, Shuo Wen, and Maria Brbi´c. Revisiting the platonic representation hypothesis: An aristotelian view. arXiv preprint arXiv:2602.14486,
-
[9]
doi: 10.1101/2025.09.06.674602. Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, et al. Olmoearth: Stable latent image modeling for multimodal earth observation.arXiv preprint arXiv:2511.13655,
-
[10]
The Platonic Representation Hypothesis
URL https://arxiv.org/abs/2405.07987. Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, et al. Terramind: Large-scale generative multimodality for earth observation. InProceedings of the IEEE/CVF International Conference on Computer Visio...
work page internal anchor Pith review Pith/arXiv arXiv
-
[11]
Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm
doi: 10.1080/15481603.2025.2594797. Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. Satclip: Global, general- purpose location embeddings with satellite imagery. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4347–4355, 2025a. 10 Better Together: Evaluating the Complementarity of Earth E...
-
[12]
Madeline C Lisaius, Srinivasan Keshav, Andrew Blake, and Clement Atzberger
URLhttps://proceedings.mlr.press/v97/kornblith19a.html. Madeline C Lisaius, Srinivasan Keshav, Andrew Blake, and Clement Atzberger. Embedding-based crop type classifica- tion in the groundnut basin of senegal.arXiv preprint arXiv:2601.16900,
-
[13]
Yuchi Ma, Yawen Shen, Anu Swatantran, and David B Lobell
doi: 10.1038/s42256-025-01116-5. Yuchi Ma, Yawen Shen, Anu Swatantran, and David B Lobell. Harvesting alphaearth: Benchmarking the geospatial foundation model for agricultural downstream tasks.International Journal of Applied Earth Observation and Geoinformation, 149:105258,
-
[14]
Daniel Panangian and Ksenia Bittner. Can location embeddings enhance super-resolution of satellite imagery? 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6136–6145,
work page 2025
-
[15]
In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)
doi: 10.1109/wacv61041.2025.00598. Oona Rainio, Jarmo Teuho, and Riku Klén. Evaluation metrics and statistical tests for machine learning.Scientific reports, 14(1):6086,
-
[16]
Using multiple input modalities can improve data-efficiency and o.o.d
Arjun Rao and Esther Rolf. Using multiple input modalities can improve data-efficiency and o.o.d. generalization for ml with satellite imagery.ArXiv, abs/2507.13385,
-
[17]
Using multiple input modalities can improve data-efficiency and o.o.d
doi: 10.48550/arxiv.2507.13385. Gabriel Tseng, Ivan Zvonkov, Catherine Lilian Nakalembe, and Hannah Kerner. Cropharvest: A global dataset for crop-type classification. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2),
-
[18]
Lightweight, pre-trained transformers for remote sensing timeseries,
URLhttps://openreview.net/forum?id=JtjzUXPEaCu. Gabriel Tseng, Ruben Cartuyvels, Ivan Zvonkov, Mirali Purohit, David Rolnick, and Hannah Kerner. Lightweight, pre-trained transformers for remote sensing timeseries.arXiv preprint arXiv:2304.14065,
-
[19]
NeurEO: Dissecting earth embeddings using computational neuroscience
Thijs L Van der Plas, Jacob JW Bakermans, and Ioannis N Athanasiadis. NeurEO: Dissecting earth embeddings using computational neuroscience. InEurIPS REO workshop 2025: The 1st workshop on Advances on Representation Learning for Earth Observation,
work page 2025
-
[20]
Below-ground Fungal Biodiversity Can be Monitored Using Self-Supervised Learning Satellite Features
Robin Young, Michael E Van Nuland, E Toby Kiers, Tomáš Vˇetrovsk`y, Petr Kohout, Petr Baldrian, and Srinivasan Keshav. Below-ground fungal biodiversity can be monitored using self-supervised learning satellite features.arXiv preprint arXiv:2604.09818,
work page internal anchor Pith review Pith/arXiv arXiv
-
[21]
Cropland mapping using geospatial embeddings
Ivan Zvonkov, Gabriel Tseng, Inbal Becker-Reshef, and Hannah Kerner. Cropland mapping using geospatial embeddings. arXiv preprint arXiv:2511.02923,
-
[22]
Crop typeCrop type classification task of 18 classes, derived from the CropHarvest dataset [Tseng et al., 2021]. We used the global CropHarvest dataset and selected all classes with ≥200 samples, yielding 18 out of 306 classes. We then randomly subsampled all 18 classes to 200 samples each to obtain a balanced dataset of 3,600 locations. BiomassBiomass pr...
work page 2021
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.