pith. sign in

arxiv: 2605.18667 · v1 · pith:D7IH7Q3Jnew · submitted 2026-05-18 · 💻 cs.CV · cs.LG

Better Together: Evaluating the Complementarity of Earth Embedding Models

Pith reviewed 2026-05-20 11:47 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords Earth embeddingsembedding fusioncomplementarityremote sensingdownstream tasksland coverspatial scalemodel evaluation
0
0 comments X

The pith

Fusing multiple Earth embedding models improves performance over the best single model on four of six downstream tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Earth embedding models turn satellite and ground data into vectors that represent specific locations on the planet. Standard evaluations test each model alone and declare a winner for a given task. This paper instead combines the vectors from different models and measures whether the combination beats the strongest individual model. On six tasks using four models, the fused versions win in four cases. The amount of extra performance varies with the task, the location, and for land-cover work with the physical size of the features being mapped. The result suggests that future progress may lie more in selecting good combinations than in perfecting any one model by itself.

Core claim

When four Earth embedding models are fused, the resulting embeddings outperform the best single model in four out of six downstream tasks. The performance gain, termed complementarity, is both task-dependent and location-dependent. In a land-cover regression task the degree of complementarity is partly explained by the spatial scale of the land-cover classes.

What carries the argument

The embedding complementarity index, which measures the task-performance gain obtained by fusing two or more embeddings over the strongest single embedding.

If this is right

  • Single-embedding evaluations underestimate overall capabilities of the Earth embedding family.
  • Complementarity strength differs across downstream tasks.
  • Complementarity also differs across geographic locations.
  • For land-cover regression the spatial scale of classes helps predict how much fusion helps.
  • The largest future gains may come from combinations rather than from any single improved model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Model developers could aim to train embeddings that deliberately fill gaps left by existing ones instead of competing only on overall accuracy.
  • Benchmarks could routinely include fused results so that reported numbers reflect the best achievable performance from the available set.
  • Task designers might choose locations or class scales where complementarity is known to be high in order to maximize gains from fusion.

Load-bearing premise

The fusion step used to join embeddings is a neutral operation that reveals true complementarity rather than creating its own performance artifacts.

What would settle it

A replication on the same tasks and models that applies multiple distinct fusion methods and finds that the performance advantage over single models disappears under every method.

Figures

Figures reproduced from arXiv: 2605.18667 by Gabriel\.e Tij\=unaityt\.e, Ioannis N Athanasiadis, Jacob JW Bakermans, Marc Ru{\ss}wurm, Thijs L van der Plas, Vishal Nedungadi.

Figure 1
Figure 1. Figure 1: Embeddings are partially similar and partially complementary. a) CCA and CKA similarity values for each pairwise combination of four Earth embeddings indicate that all embedding combinations are substantially, but not completely, similar. b) Embedding complementarity was assessed by comparing their fused downstream task performance (green) against their individual task performances (blue, yellow). In this … view at source ↗
Figure 2
Figure 2. Figure 2: Embedding complementarity varies spatially. Complementarity was computed per point for AlphaEarth and Tessera, for the bioclimatic (top) and land cover (bottom) tasks, using the MSE per point. Positive values indicate positive complementarity, and negative values were clipped at −1. 6 Limitations In this study, we evaluated the complementarity of four Earth embedding models (which have pre-computed embeddi… view at source ↗
read the original abstract

Earth embedding models transform Earth observation data into embeddings uniquely tied to locations on the Earth's surface. These models are typically evaluated in isolation, comparing the downstream task performance across different Earth embeddings. However, spatially aligned embeddings can naturally be fused, providing richer information per location, a capability that isolated evaluations fail to capture. We therefore propose assessing Earth embeddings by their complementarity: the performance gain of fused embeddings over the best single-model baseline. To operationalise this, we introduce an embedding complementarity index applicable to any embedding and task, and evaluate four Earth embedding models (AlphaEarth, Tessera, GeoCLIP, SatCLIP) in isolation, in all pairs, and jointly across six downstream tasks. Fused embeddings outperform the best single model in four out of six tasks, confirming that single-embedding evaluations often underestimate Earth embedding capabilities. Complementarity proves both task- and location-dependent. Further, for a land cover regression task, we find that complementarity is partially determined by the spatial scale of land cover classes. Complementarity reframes Earth embeddings: the greatest future gains may come not from any single Earth embedding model, but from combinations that are better together.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper argues that Earth embedding models should be assessed via their complementarity when fused rather than evaluated in isolation. It introduces an embedding complementarity index, evaluates four models (AlphaEarth, Tessera, GeoCLIP, SatCLIP) in isolation, pairs, and jointly on six downstream tasks, and reports that fused embeddings outperform the best single-model baseline in four out of six tasks, with complementarity being task- and location-dependent.

Significance. If the empirical results hold after controlling for dimensionality and providing statistical details, the work offers a constructive reframing of embedding evaluation in Earth observation, suggesting that future progress may lie in model combinations. The complementarity index is a potentially reusable contribution, though its parameter-free status and general applicability require clearer demonstration.

major comments (2)
  1. [Experimental evaluation] Evaluation protocol (described in the abstract and experimental sections): the direct comparison of fused (concatenated) embeddings against single-model baselines lacks dimension-matched controls such as PCA-reduced fusions or random-feature baselines of equivalent dimensionality. This leaves open the possibility that reported gains in 4/6 tasks arise from increased input capacity rather than genuine complementarity, which is load-bearing for the central claim.
  2. [Results] Results reporting: the manuscript states clear outperformance in four out of six tasks but provides no error bars, statistical significance tests, or details on data splits and fusion implementation. Without these, the reliability of the 4/6 count and the complementarity index cannot be verified.
minor comments (2)
  1. [Methods] The definition and computation of the embedding complementarity index should be given explicitly with a formula or pseudocode to allow reproduction.
  2. [Methods] Clarify whether fusion is simple concatenation or includes any normalization/alignment steps, and discuss sensitivity to these choices.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments, which help strengthen the rigor of our evaluation of Earth embedding complementarity. We address each major comment below and will incorporate revisions to address the concerns raised.

read point-by-point responses
  1. Referee: [Experimental evaluation] Evaluation protocol (described in the abstract and experimental sections): the direct comparison of fused (concatenated) embeddings against single-model baselines lacks dimension-matched controls such as PCA-reduced fusions or random-feature baselines of equivalent dimensionality. This leaves open the possibility that reported gains in 4/6 tasks arise from increased input capacity rather than genuine complementarity, which is load-bearing for the central claim.

    Authors: We agree that dimension-matched controls are essential to isolate genuine complementarity from the effects of increased feature dimensionality. Although our primary experiments demonstrate the value of direct fusion, the revised manuscript will include additional baselines: (1) PCA-reduced versions of the concatenated embeddings projected to match the dimensionality of the strongest single-model baseline, and (2) random projection baselines of equivalent dimensionality. These controls will be reported for all tasks to confirm that observed gains in four of six tasks arise from complementary information rather than capacity alone. revision: yes

  2. Referee: [Results] Results reporting: the manuscript states clear outperformance in four out of six tasks but provides no error bars, statistical significance tests, or details on data splits and fusion implementation. Without these, the reliability of the 4/6 count and the complementarity index cannot be verified.

    Authors: We appreciate this feedback on reproducibility. The revised manuscript will expand the experimental section to include: error bars (standard deviation across repeated runs or cross-validation folds), statistical significance tests (e.g., paired t-tests or Wilcoxon tests) comparing fused versus best single-model performance, and detailed descriptions of data splits, preprocessing steps prior to fusion, the exact concatenation procedure, and implementation details (including any normalization or handling of missing values). These additions will allow independent verification of the 4/6 outperformance count and the complementarity index. revision: yes

Circularity Check

0 steps flagged

No circularity: direct empirical comparisons of fused vs. single embeddings

full rationale

The paper's derivation consists of an empirical protocol that measures downstream task performance for single embeddings, all pairwise fusions, and the joint fusion across six tasks, then reports the gain of the fused case over the best single baseline. The complementarity index is defined directly as this observed performance difference and is computed on independent task evaluations; no equations reduce the reported gains to fitted parameters defined from the same data, no self-citations serve as load-bearing uniqueness theorems, and no ansatz or renaming of prior results is invoked to derive the central claim. The evaluation is self-contained and replicable from the stated experimental setup.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work rests on the domain assumption that the chosen six tasks adequately sample the space of Earth-observation applications and that the four selected models are representative of current Earth embedding approaches.

axioms (1)
  • domain assumption The six downstream tasks are representative of real-world uses of Earth embeddings.
    The complementarity claim depends on these tasks capturing meaningful differences between models.
invented entities (1)
  • Embedding complementarity index no independent evidence
    purpose: Quantify performance gain from fusing embeddings over the best single model.
    New metric introduced to operationalize the complementarity assessment.

pith-pipeline@v0.9.0 · 5766 in / 1225 out tokens · 46689 ms · 2026-05-20T11:47:44.875164+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · 4 internal anchors

  1. [1]

    James GC Ball, Jana Annika Wicklein, Zhengpeng Feng, Jovana Knezevic, Sadiq Jaffer, Anil Madhavapeddy, Clement Atzberger, Michele Dalponte, and David Coomes

    URLhttps://arxiv.org/abs/2204.01678. James GC Ball, Jana Annika Wicklein, Zhengpeng Feng, Jovana Knezevic, Sadiq Jaffer, Anil Madhavapeddy, Clement Atzberger, Michele Dalponte, and David Coomes. Geospatial foundation models enable data-efficient tree species mapping in temperate mountain forests.bioRxiv,

  2. [2]

    doi: 10.64898/2026.02.23.707022. Christopher F Brown, Steven P Brumby, Brookie Guzder-Williams, Tanya Birch, Samantha Brooks Hyde, Joseph Mazzariello, Wanda Czerwinski, Valerie J Pasquarella, Robert Haertel, Simon Ilyushchenko, et al. Dynamic world, near real-time global 10 m land use land cover mapping.Scientific data, 9(1):251,

  3. [3]

    AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

    Christopher F Brown, Michal R Kazmierski, Valerie J Pasquarella, William J Rucklidge, Masha Samsikova, Chenhui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, et al. Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data.arXiv preprint arXiv:2507.22291,

  4. [4]

    doi: 10.1145/3787217

    ISSN 2374-0353. doi: 10.1145/3787217. URL https://doi.org/10.1145/ 3787217. Aayush Dhakal, Srikumar Sastry, Subash Khanal, Adeel Ahmad, Eric Xing, and Nathan Jacobs. Range: Retrieval augmented neural fields for multi-resolution geo-embeddings. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 24680–24689, June

  5. [5]

    Earth embeddings as products: Taxonomy, ecosystem, and standardized access.arXiv preprint arXiv:2601.13134,

    Heng Fang, Adam J Stewart, Isaac Corley, Xiao Xiang Zhu, and Hossein Azizpour. Earth embeddings as products: Taxonomy, ecosystem, and standardized access.arXiv preprint arXiv:2601.13134,

  6. [6]

    TESSERA: Temporal Embeddings of Surface Spectra for Earth Representation and Analysis

    Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline C Lisaius, Markus Immitzer, Toby Jackson, James Ball, et al. Tessera: Temporal embeddings of surface spectra for earth representation and analysis.arXiv preprint arXiv:2506.20380,

  7. [7]

    Mmearth-bench: Global model adaptation via multimodal test-time training.arXiv preprint arXiv:2602.06285,

    Lucia Gordon, Serge Belongie, Christian Igel, and Nico Lang. Mmearth-bench: Global model adaptation via multimodal test-time training.arXiv preprint arXiv:2602.06285,

  8. [8]

    Revisiting the platonic representation hypothesis: An aristotelian view.arXiv preprint arXiv:2602.14486,

    Fabian Gröger, Shuo Wen, and Maria Brbi´c. Revisiting the platonic representation hypothesis: An aristotelian view. arXiv preprint arXiv:2602.14486,

  9. [9]

    Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, et al

    doi: 10.1101/2025.09.06.674602. Henry Herzog, Favyen Bastani, Yawen Zhang, Gabriel Tseng, Joseph Redmon, Hadrien Sablon, Ryan Park, Jacob Morrison, Alexandra Buraczynski, Karen Farley, et al. Olmoearth: Stable latent image modeling for multimodal earth observation.arXiv preprint arXiv:2511.13655,

  10. [10]

    The Platonic Representation Hypothesis

    URL https://arxiv.org/abs/2405.07987. Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, et al. Terramind: Large-scale generative multimodality for earth observation. InProceedings of the IEEE/CVF International Conference on Computer Visio...

  11. [11]

    Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm

    doi: 10.1080/15481603.2025.2594797. Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. Satclip: Global, general- purpose location embeddings with satellite imagery. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 4347–4355, 2025a. 10 Better Together: Evaluating the Complementarity of Earth E...

  12. [12]

    Madeline C Lisaius, Srinivasan Keshav, Andrew Blake, and Clement Atzberger

    URLhttps://proceedings.mlr.press/v97/kornblith19a.html. Madeline C Lisaius, Srinivasan Keshav, Andrew Blake, and Clement Atzberger. Embedding-based crop type classifica- tion in the groundnut basin of senegal.arXiv preprint arXiv:2601.16900,

  13. [13]

    Yuchi Ma, Yawen Shen, Anu Swatantran, and David B Lobell

    doi: 10.1038/s42256-025-01116-5. Yuchi Ma, Yawen Shen, Anu Swatantran, and David B Lobell. Harvesting alphaearth: Benchmarking the geospatial foundation model for agricultural downstream tasks.International Journal of Applied Earth Observation and Geoinformation, 149:105258,

  14. [14]

    Can location embeddings enhance super-resolution of satellite imagery? 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6136–6145,

    Daniel Panangian and Ksenia Bittner. Can location embeddings enhance super-resolution of satellite imagery? 2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 6136–6145,

  15. [15]

    In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (W ACV)

    doi: 10.1109/wacv61041.2025.00598. Oona Rainio, Jarmo Teuho, and Riku Klén. Evaluation metrics and statistical tests for machine learning.Scientific reports, 14(1):6086,

  16. [16]

    Using multiple input modalities can improve data-efficiency and o.o.d

    Arjun Rao and Esther Rolf. Using multiple input modalities can improve data-efficiency and o.o.d. generalization for ml with satellite imagery.ArXiv, abs/2507.13385,

  17. [17]

    Using multiple input modalities can improve data-efficiency and o.o.d

    doi: 10.48550/arxiv.2507.13385. Gabriel Tseng, Ivan Zvonkov, Catherine Lilian Nakalembe, and Hannah Kerner. Cropharvest: A global dataset for crop-type classification. InThirty-fifth Conference on Neural Information Processing Systems Datasets and Benchmarks Track (Round 2),

  18. [18]

    Lightweight, pre-trained transformers for remote sensing timeseries,

    URLhttps://openreview.net/forum?id=JtjzUXPEaCu. Gabriel Tseng, Ruben Cartuyvels, Ivan Zvonkov, Mirali Purohit, David Rolnick, and Hannah Kerner. Lightweight, pre-trained transformers for remote sensing timeseries.arXiv preprint arXiv:2304.14065,

  19. [19]

    NeurEO: Dissecting earth embeddings using computational neuroscience

    Thijs L Van der Plas, Jacob JW Bakermans, and Ioannis N Athanasiadis. NeurEO: Dissecting earth embeddings using computational neuroscience. InEurIPS REO workshop 2025: The 1st workshop on Advances on Representation Learning for Earth Observation,

  20. [20]

    Below-ground Fungal Biodiversity Can be Monitored Using Self-Supervised Learning Satellite Features

    Robin Young, Michael E Van Nuland, E Toby Kiers, Tomáš Vˇetrovsk`y, Petr Kohout, Petr Baldrian, and Srinivasan Keshav. Below-ground fungal biodiversity can be monitored using self-supervised learning satellite features.arXiv preprint arXiv:2604.09818,

  21. [21]

    Cropland mapping using geospatial embeddings

    Ivan Zvonkov, Gabriel Tseng, Inbal Becker-Reshef, and Hannah Kerner. Cropland mapping using geospatial embeddings. arXiv preprint arXiv:2511.02923,

  22. [22]

    We used the global CropHarvest dataset and selected all classes with ≥200 samples, yielding 18 out of 306 classes

    Crop typeCrop type classification task of 18 classes, derived from the CropHarvest dataset [Tseng et al., 2021]. We used the global CropHarvest dataset and selected all classes with ≥200 samples, yielding 18 out of 306 classes. We then randomly subsampled all 18 classes to 200 samples each to obtain a balanced dataset of 3,600 locations. BiomassBiomass pr...