pith. the verified trust layer for science. sign in

arxiv: 2603.10658 · v2 · submitted 2026-03-11 · 💻 cs.CV

How to Embed Matters: Evaluation of EO Embedding Design Choices

Pith reviewed 2026-05-15 13:38 UTC · model grok-4.3

classification 💻 cs.CV
keywords earth observationgeospatial foundation modelsembeddingsfeature extractiontransformerresnetself-supervised learningbenchmark
0
0 comments X p. Extension

The pith

Embedding design choices in geospatial foundation models shape performance on earth observation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper systematically tests how decisions about extracting, aggregating, and combining representations from Geospatial Foundation Models affect results on downstream earth observation tasks. It examines backbone architecture, pretraining objectives, feature depth, spatial pooling, and multi-embedding fusion using a dedicated benchmark. The analysis shows that certain combinations produce compact representations over 500 times smaller than raw imagery while preserving utility across tasks. These patterns highlight trade-offs that matter for building scalable, reusable embedding pipelines in large-scale EO workflows.

Core claim

Experiments across multiple models establish that transformer backbones paired with mean pooling deliver strong default embeddings, intermediate ResNet layers can surpass final-layer features, self-supervised pretraining objectives display task-dependent strengths, and fusing embeddings from different objectives improves robustness on earth observation benchmarks.

What carries the argument

Comparative evaluation of embedding strategies covering backbone type, representation depth, spatial aggregation method, and objective combination.

Load-bearing premise

The performance patterns seen on the benchmark dataset generalize to other earth observation data, sensors, and tasks.

What would settle it

A new dataset or task where mean pooling on transformers consistently underperforms alternatives or where combining embeddings fails to increase robustness would disprove the reported trends.

Figures

Figures reproduced from arXiv: 2603.10658 by Arne Ewald, Isabelle Wittmann, Johannes Jakubik, Luis Gilch, Maximilian Nitsche, Thomas Brunschwiler.

Figure 1
Figure 1. Figure 1: Per-task embedding performance across design choices. Distribution of regression performance across GeoFM backbones, self-supervised pretraining strategies, spatial aggregation methods, intermediate layers, and representation combinations. Performance is measured using mean R 2 (left), reflecting predictive accuracy, and the NeuCo Quality Score (right), which accounts for variability to reflect robustness.… view at source ↗
Figure 2
Figure 2. Figure 2: Per-task Q-score comparison of ResNet-50 (left) and ViT-Small (right) FMs. We use final-layer embeddings with mean pooling; negative scores are clipped to zero. ResNet models score high on semantic/land-cover tasks but show little performance elsewhere. ViT models are more consistent across tasks and achieve meaningful performance beyond land cover: TerraMind is the most consistent overall, DINO is strong … view at source ↗
Figure 3
Figure 3. Figure 3: Per-task Q-score comparison of spatial aggregation method for ResNet-50 (left) and ViT-Small (right). We use final-layer embeddings with mean, min, or max pooling (or the CLS token for ViT) and average scores across models; negative scores are clipped to zero. For ResNet, mean pooling performs best across tasks, with max pooling outperforming min pooling. For ViT, mean pooling again performs best, with CLS… view at source ↗
Figure 4
Figure 4. Figure 4: Per-task and overall ∆R 2 (left) and ∆ Q-score (right) for embedding concatenation. Top: Intra-method concatenation (Mean + CLS within the same ViT-Small SSL4EO model). Bottom: Inter-method concatenation (Mean + Mean across different SSL objectives). For each task, the baseline is the stronger individual embedding, and we report ∆ = scoreconcat −scorebaseline (zero indicates no change). We additionally rep… view at source ↗
Figure 5
Figure 5. Figure 5: Layer-wise task-averaged performance (R 2 , left; Q-score, right). Top: ViT-Small; bottom: ResNet-50 (SSL4EO). Rep￾resentations are extracted from each layer (12 transformer blocks; 5 ResNet stages with output dimensions 64, 256, 512, 1024, 2048); negative task values are clipped before averaging. ViT performance increases in early layers and then saturates, whereas ResNet shows an inverted-U pattern, peak… view at source ↗
Figure 6
Figure 6. Figure 6: Per-task R 2 comparison of ResNet-50 (left) and ViT-Small (right) FMs. Final-layer embeddings with mean pooling are used. In contrast to the main paper’s Q-score visualization, this plot reports raw predictive performance (R 2 ) per task. The overall ranking trends remain consistent: ResNet models perform strongly on semantic/land-cover tasks but show limited transfer beyond them, while ViT models are more… view at source ↗
Figure 7
Figure 7. Figure 7: Per-task R 2 comparison of spatial aggregation methods for ResNet-50 (left) and ViT-Small (right). Final-layer embeddings are evaluated using mean, min, or max pooling (and the CLS token for ViT), with scores averaged across models. This R 2 view confirms the Q-score trends reported in the main paper: mean pooling consistently yields the strongest performance across tasks and backbones. For ResNet, max poo… view at source ↗
Figure 8
Figure 8. Figure 8: Per-task R 2 radar plots for embedding concatenation experiments. We report results for all tested combinations, comparing the two individual baselines with their concatenated representation. The plots illustrate that concatenation typically preserves the stronger baseline and yields modest, task-dependent improvements. In particular, combinations such as SoftCon+DINO most consistently show positive deviat… view at source ↗
Figure 9
Figure 9. Figure 9: Layer-wise per-task downstream performance (R 2 ) for ViT-Small models pretrained on SSL4EO. Results are shown separately per task across layer depth. Semantic and land-cover targets exhibit increasing and saturating trends toward deeper layers, consistent with the averaged analysis in the main paper. Other tasks show early saturation or slight degradation at greater depth. 0 20 40 Q-Score Crops 0 10 Cloud… view at source ↗
Figure 10
Figure 10. Figure 10: Layer-wise per-task downstream performance (Q-score) for ViT-Small models pretrained on SSL4EO. The robustness trends largely mirror the R 2 behavior, confirming that depth-dependent effects are consistent across predictive accuracy and stability metrics [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Layer-wise per-task downstream performance (R 2 ) for ResNet-50 models pretrained on SSL4EO. Semantic and land￾cover tasks show increasing and saturating trends similar to ViT models. In contrast, several other tasks exhibit a pronounced drop at the final-layer, intermediate layers frequently remain competitive with ViT final-layer embeddings. 0 20 40 Q-Score Crops 0 10 Clouds 0 25 50 Landcover Agricultur… view at source ↗
Figure 12
Figure 12. Figure 12: Layer-wise per-task downstream performance (Q-score) for ResNet-50 models pretrained on SSL4EO. The robustness metric reinforces the R 2 trends, highlighting stronger depth sensitivity in ResNet compared to ViT backbones [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗
read the original abstract

Earth observation (EO) missions produce petabytes of multispectral imagery, increasingly analyzed using large Geospatial Foundation Models (GeoFMs). Alongside end-to-end adaptation, workflows make growing use of intermediate representations as task-agnostic embeddings, enabling models to compute representations once and reuse them across downstream tasks. Consequently, when GeoFMs act as feature extractors, decisions about how representations are obtained, aggregated, and combined affect downstream performance and pipeline scalability. Understanding these trade-offs is essential for scalable embedding-based EO workflows, where compact embeddings can replace raw data while remaining broadly useful. We present a systematic analysis of embedding design in GeoFM-based EO workflows. Leveraging NeuCo-Bench, we study how backbone architecture, pretraining strategy, representation depth, spatial aggregation, and representation combination influence EO task performance. We demonstrate the usability of GeoFM embeddings by aggregating them into fixed-size representations more than 500x smaller than the raw input data. Across models, we find consistent trends: transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives exhibit task-specific strengths, and combining embeddings from different objectives often improves robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates embedding design choices for Geospatial Foundation Models (GeoFMs) in Earth Observation (EO) workflows using the NeuCo-Bench benchmark. It systematically analyzes the effects of backbone architecture, pretraining strategy, representation depth, spatial aggregation, and embedding combination on downstream task performance. The central findings are that transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives have task-specific strengths, and combining embeddings from different objectives improves robustness, all while achieving over 500x compression compared to raw data.

Significance. If the reported trends hold beyond the evaluated benchmark, this work offers actionable insights for scalable EO pipelines that leverage GeoFMs as fixed feature extractors. The emphasis on compact, reusable embeddings addresses key challenges in handling petabyte-scale multispectral imagery, potentially guiding practitioners toward more efficient and robust workflows.

major comments (2)
  1. Results section: The claims of 'consistent trends' (transformer+mean pooling as default, intermediate ResNet layers outperforming final layers, task-specific self-supervised strengths, and robustness from combinations) are presented without error bars, statistical significance tests, or details on dataset splits and controls for confounding factors, leaving the empirical support for these load-bearing findings only moderately substantiated.
  2. Discussion section: The positioning of NeuCo-Bench trends as relevant to general GeoFM workflows (including 500x compression benefits) is not supported by any cross-dataset or cross-sensor validation; if NeuCo-Bench shares unaccounted biases in sensor characteristics or label distributions, the reported design preferences may not generalize and undermine the claimed utility.
minor comments (2)
  1. Abstract: Expand 'EO' and 'GeoFM' on first use and provide a brief parenthetical definition of NeuCo-Bench for readers unfamiliar with the benchmark.
  2. Methods: Include explicit pseudocode or equations for the spatial aggregation (e.g., mean pooling) and embedding combination procedures to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We address each major comment below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses
  1. Referee: Results section: The claims of 'consistent trends' (transformer+mean pooling as default, intermediate ResNet layers outperforming final layers, task-specific self-supervised strengths, and robustness from combinations) are presented without error bars, statistical significance tests, or details on dataset splits and controls for confounding factors, leaving the empirical support for these load-bearing findings only moderately substantiated.

    Authors: We agree that the empirical support can be strengthened by including error bars, statistical tests, and additional methodological details. In the revised version, we will report error bars based on multiple random seeds or cross-validation folds for the key performance metrics. We will also include statistical significance tests (e.g., paired t-tests) for the main comparisons supporting our 'consistent trends' claims. Furthermore, we will expand the experimental setup section to provide complete details on dataset splits, preprocessing, and any controls for confounding factors. These additions will make the results more robustly substantiated. revision: yes

  2. Referee: Discussion section: The positioning of NeuCo-Bench trends as relevant to general GeoFM workflows (including 500x compression benefits) is not supported by any cross-dataset or cross-sensor validation; if NeuCo-Bench shares unaccounted biases in sensor characteristics or label distributions, the reported design preferences may not generalize and undermine the claimed utility.

    Authors: We acknowledge this as a valid limitation of the current study. While NeuCo-Bench includes a variety of EO tasks and sensor modalities to promote diversity, we agree that broader cross-dataset and cross-sensor validation would enhance generalizability claims. In the revision, we will temper the language in the discussion to specify that the observed trends hold within the NeuCo-Bench benchmark and discuss potential biases related to sensor characteristics and label distributions. We will retain the 500x compression claim as it is a direct comparison of embedding size to raw data size, independent of the specific benchmark, but clarify its applicability to embedding-based workflows in general. We will also add a section on limitations and future work to address generalization. revision: partial

Circularity Check

0 steps flagged

No significant circularity: pure empirical benchmarking study

full rationale

The paper conducts a systematic empirical evaluation of EO embedding design choices (backbone architecture, pretraining strategy, representation depth, spatial aggregation, and combination) by measuring downstream task performance on the external NeuCo-Bench benchmark. No mathematical derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations are present. All reported trends (e.g., transformer+mean pooling as strong default, intermediate ResNet layers outperforming final layers) are direct observations from benchmark metrics and do not reduce to the paper's own inputs by construction. The analysis is self-contained against the stated benchmark without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical evaluation paper that introduces no new mathematical objects or fitted parameters; relies on existing foundation models and the NeuCo-Bench benchmark.

axioms (1)
  • domain assumption NeuCo-Bench is representative of real-world EO tasks and datasets.
    All reported trends depend on this benchmark serving as a valid proxy.

pith-pipeline@v0.9.0 · 5520 in / 1132 out tokens · 45158 ms · 2026-05-15T13:38:37.183423+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 2 internal anchors

  1. [1]

    Ter- ramesh: A planetary mosaic of multimodal earth observation data.arXiv preprint arXiv:2504.11172, 2025

    Benedikt Blumenstiel, Paolo Fraccaro, Valerio Marsocci, Johannes Jakubik, Stefano Maurogiovanni, Mikolaj Cz- erkawski, Rocco Sedona, Gabriele Cavallaro, Thomas Brun- schwiler, Juan Bernabe-Moreno, and Nicolas Long´ep´e. Ter- ramesh: A planetary mosaic of multimodal earth observation data.arXiv preprint arXiv:2504.11172, 2025. 5

  2. [2]

    Ssl4eo-s12 v1. 1: A multimodal, multiseasonal dataset for pretraining, updated,

    Benedikt Blumenstiel, Nassim Ait Ali Braham, Conrad M. Albrecht, Stefano Maurogiovanni, and Paolo Fraccaro. Ssl4eo-s12 v1.1: A multimodal, multiseasonal dataset for pretraining, updated.arXiv preprint arXiv:2503.00168,

  3. [3]

    On the Opportunities and Risks of Foundation Models

    Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, and oth- ers. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 1

  4. [4]

    Brown, Michal R

    Christopher F. Brown, Michal R. Kazmierski, Valerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chen- hui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Lydia Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, and Pushmeet Kohli. AlphaEarth Foundatio...

  5. [5]

    Emerg- ing properties in self-supervised vision transformers

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3

  6. [6]

    Global and dense embeddings of Earth: Major TOM floating in the latent space,

    Mikolaj Czerkawski, Marcin Kluczek, J ¨A Bojanowski, and others. Global and dense embeddings of earth: Ma- jor tom floating in the latent space.arXiv preprint arXiv:2412.05600, 2024. 2

  7. [7]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3

  8. [8]

    SSL4EO-S12-downstream.https : / / huggingface

    Embed2Scale. SSL4EO-S12-downstream.https : / / huggingface . co / datasets / embed2scale / SSL4EO - S12 - downstream, 2025. Hugging Face dataset. 3

  9. [9]

    Sentinel mission overview,

    ESA, SentiwikiCopernicus. Sentinel mission overview,

  10. [10]

    Accessed: 2025-08-02. 1

  11. [11]

    Li- saius, Markus Immitzer, Toby Jackson, James Ball, David A

    Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline C. Li- saius, Markus Immitzer, Toby Jackson, James Ball, David A. Coomes, Anil Madhavapeddy, Andrew Blake, and Srini- vasan Keshav. TESSERA: Temporal embeddings of surface spectra for earth representation and analysis, 2025. 2

  12. [12]

    Ter- ratorch: The geospatial foundation models toolkit.arXiv preprint arXiv:2503.20563, 2025

    Carlos Gomes, Benedikt Blumenstiel, Joao Lucas de Sousa Almeida, Pedro Henrique de Oliveira, Paolo Frac- caro, Francesc Marti Escofet, Daniela Szwarcman, Naomi Simumba, Romeo Kienzler, and Bianca Zadrozny. Ter- ratorch: The geospatial foundation models toolkit.arXiv preprint arXiv:2503.20563, 2025. 3

  13. [13]

    Earth obser- vation big data for climate change research.Advances in Climate Change Research, 6(2):108–117, 2015

    Hua-Dong Guo, Li Zhang, and Lan-Wei Zhu. Earth obser- vation big data for climate change research.Advances in Climate Change Research, 6(2):108–117, 2015. Publisher: Elsevier. 1

  14. [14]

    Deep residual learning for image recognition

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 3

  15. [15]

    Momentum contrast for unsupervised visual rep- resentation learning

    Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9729–9738, 2020. 2, 3

  16. [16]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2, 3

  17. [17]

    Object detection and image segmentation with deep learning on earth observation data: A review-part i: Evolution and recent trends.Remote Sensing, 12(10):1667, 2020

    Thorsten Hoeser and Claudia Kuenzer. Object detection and image segmentation with deep learning on earth observation data: A review-part i: Evolution and recent trends.Remote Sensing, 12(10):1667, 2020. Publisher: MDPI. 2

  18. [18]

    TerraMind-1.0-small.https : / / huggingface

    IBM ESA Geospatial. TerraMind-1.0-small.https : / / huggingface . co / ibm - esa - geospatial / TerraMind- 1.0- small, 2025. Hugging Face model release. 3

  19. [19]

    Terramind: Large-scale generative multi- modality for earth observation

    Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brun- schwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, and Nicolas Long´ep´e. Terramind: Large-scale generative multi- modality for earth obser...

  20. [20]

    Satclip: Global, general- purpose location embeddings with satellite imagery

    Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. Satclip: Global, general- purpose location embeddings with satellite imagery. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 4347–4355, 2025. Issue: 4. 2

  21. [21]

    Earth embeddings: Towards ai-centric representations of our planet.EarthArXiv preprint, 2025

    Konstantin Klemmer, Esther Rolf, Marc Rußwurm, Gus- tau Camps-Valls, Mikolaj Czerkawski, Stefano Ermon, Alis- tair Francis, Nathan Jacobs, Hannah Kerner, Lester Mackey, Gengchen Mai, Oisin Mac Aodha, Markus Reichstein, Caleb Robinson, David Rolnick, Evan Shelhamer, Vincent Sitz- mann, Devis Tuia, and Xiao Xiang Zhu. Earth embeddings: Towards ai-centric re...

  22. [22]

    Geo-bench: Toward foundation models for earth monitor- ing.Advances in Neural Information Processing Systems, 36:51080–51093, 2023

    Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Bj¨orn L¨utjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, and others. Geo-bench: Toward foundation models for earth monitor- ing.Advances in Neural Information Processing Systems, 36:51080–51093, 2023. 2

  23. [23]

    PANGAEA: A global and inclusive benchmark for geospatial foundation models,

    Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, and others. Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024. 2

  24. [24]

    Rethinking transformers pre-training for multi- spectral satellite imagery

    Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shah- baz Khan. Rethinking transformers pre-training for multi- spectral satellite imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27811–27819, 2024. 8

  25. [25]

    Do vision trans- formers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128,

    Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans- formers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128,

  26. [26]

    GEO- Bench-2: From performance to capability, rethinking eval- uation in geospatial AI, 2026

    Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel, Salman Khan, Manil Maskey, Nicolas Longepe, Xiao Xiang Zhu, Hannah Kerner, Juan Bernabe-Moreno, and Alexandre Lacoste. GEO- Bench-2: From performance to capability, rethinking eval- uation in geospatial AI, 2026. 2

  27. [27]

    Stewart, Caleb Robinson, Isaac A

    Adam J. Stewart, Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. Torch- Geo: Deep learning with geospatial data.ACM Trans. Spa- tial Algorithms Syst., 11(4):1–28, 2025. 3

  28. [28]

    Albrecht

    Rikard Vinge, Isabelle Wittmann, Jannik Schneider, Michael Marszalek, Luis Gilch, Thomas Brunschwiler, and Con- rad M. Albrecht. Neuco-bench: A novel benchmark frame- work for neural embeddings in earth observation.arXiv preprint arXiv:2510.17914, 2025. 2, 3, 5

  29. [29]

    Albrecht, Nassim Ait Ali Braham, Lichao Mou, and Xiao Xiang Zhu

    Yi Wang, Conrad M. Albrecht, Nassim Ait Ali Braham, Lichao Mou, and Xiao Xiang Zhu. Self-supervised learning in remote sensing: A review.IEEE Geoscience and Remote Sensing Magazine, 11(3):22–51, 2023. 1

  30. [30]

    Y . Wang, N. A. A. Braham, Z. Xiong, C. Liu, C. M. Albrecht, and X. X. Zhu. Ssl4eo-s12: A large-scale multimodal, mul- titemporal dataset for self-supervised learning in earth ob- servation.IEEE Geosci. Remote Sens. Mag., 11(3):98–106,

  31. [31]

    Decou- pling common and unique representations for multimodal self-supervised learning

    Yi Wang, Conrad M Albrecht, Nassim Ait Ali Braham, Chenying Liu, Zhitong Xiong, and Xiao Xiang Zhu. Decou- pling common and unique representations for multimodal self-supervised learning. InEuropean Conference on Com- puter Vision, pages 286–303. Springer, 2024. 2, 3

  32. [32]

    Multi- label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Yi Wang, Conrad M Albrecht, and Xiao Xiang Zhu. Multi- label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining.IEEE Transactions on Geoscience and Remote Sensing, 2024. Publisher: IEEE. 2, 3

  33. [33]

    Feature guided masked autoencoder for self-supervised learning in remote sensing.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024

    Yi Wang, Hugo Hern ´andez Hern´andez, Conrad M Albrecht, and Xiao Xiang Zhu. Feature guided masked autoencoder for self-supervised learning in remote sensing.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024. Publisher: IEEE. 2, 3

  34. [34]

    Wilkinson, M.M

    R. Wilkinson, M.M. Mleczko, R.J.W. Brewin, K.J. Gaston, M. Mueller, J.D. Shutler, X. Yan, and K. Anderson. Environ- mental impacts of earth observation data in the constellation and cloud computing era.Science of The Total Environment, 909:168584, 2024. 1, 2

  35. [35]

    Foundation models for remote sensing and earth observation: A sur- vey.IEEE Geoscience and Remote Sensing Magazine, 13 (4):297–324, 2025

    Aoran Xiao, Weihao Xuan, Junjue Wang, Jiaxing Huang, Dacheng Tao, Shijian Lu, and Naoto Yokoya. Foundation models for remote sensing and earth observation: A sur- vey.IEEE Geoscience and Remote Sensing Magazine, 13 (4):297–324, 2025. 1, 2

  36. [36]

    Earthnets: Empowering ai in earth obser- vation.arXiv preprint arXiv:2210.04936, 2022

    Zhitong Xiong, Fahong Zhang, Yi Wang, Yilei Shi, and Xiao Xiang Zhu. Earthnets: Empowering ai in earth obser- vation.arXiv preprint arXiv:2210.04936, 2022. 2

  37. [37]

    Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu

    Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired multimodal foundation model for earth observation, 2024. 2

  38. [38]

    Deep learning in remote sensing: A review.IEEE Geoscience and Remote Sensing Magazine, 5(4):8–36, 2017

    Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep learning in remote sensing: A review.IEEE Geoscience and Remote Sensing Magazine, 5(4):8–36, 2017. 1 How to Embed Matters: Evaluation of EO Embedding Design Choices Supplementary Material In the supplementary material, we provide additional re- s...