arxiv: 2603.10658 · v2 · submitted 2026-03-11 · 💻 cs.CV

How to Embed Matters: Evaluation of EO Embedding Design Choices

Luis Gilch , Isabelle Wittmann , Maximilian Nitsche , Johannes Jakubik , Arne Ewald , Thomas Brunschwiler This is my paper

Pith reviewed 2026-05-15 13:38 UTC · model grok-4.3

classification 💻 cs.CV

keywords earth observationgeospatial foundation modelsembeddingsfeature extractiontransformerresnetself-supervised learningbenchmark

0 comments p. Extension

The pith

Embedding design choices in geospatial foundation models shape performance on earth observation tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper systematically tests how decisions about extracting, aggregating, and combining representations from Geospatial Foundation Models affect results on downstream earth observation tasks. It examines backbone architecture, pretraining objectives, feature depth, spatial pooling, and multi-embedding fusion using a dedicated benchmark. The analysis shows that certain combinations produce compact representations over 500 times smaller than raw imagery while preserving utility across tasks. These patterns highlight trade-offs that matter for building scalable, reusable embedding pipelines in large-scale EO workflows.

Core claim

Experiments across multiple models establish that transformer backbones paired with mean pooling deliver strong default embeddings, intermediate ResNet layers can surpass final-layer features, self-supervised pretraining objectives display task-dependent strengths, and fusing embeddings from different objectives improves robustness on earth observation benchmarks.

What carries the argument

Comparative evaluation of embedding strategies covering backbone type, representation depth, spatial aggregation method, and objective combination.

Load-bearing premise

The performance patterns seen on the benchmark dataset generalize to other earth observation data, sensors, and tasks.

What would settle it

A new dataset or task where mean pooling on transformers consistently underperforms alternatives or where combining embeddings fails to increase robustness would disprove the reported trends.

Figures

Figures reproduced from arXiv: 2603.10658 by Arne Ewald, Isabelle Wittmann, Johannes Jakubik, Luis Gilch, Maximilian Nitsche, Thomas Brunschwiler.

**Figure 1.** Figure 1: Per-task embedding performance across design choices. Distribution of regression performance across GeoFM backbones, self-supervised pretraining strategies, spatial aggregation methods, intermediate layers, and representation combinations. Performance is measured using mean R 2 (left), reflecting predictive accuracy, and the NeuCo Quality Score (right), which accounts for variability to reflect robustness.… view at source ↗

**Figure 2.** Figure 2: Per-task Q-score comparison of ResNet-50 (left) and ViT-Small (right) FMs. We use final-layer embeddings with mean pooling; negative scores are clipped to zero. ResNet models score high on semantic/land-cover tasks but show little performance elsewhere. ViT models are more consistent across tasks and achieve meaningful performance beyond land cover: TerraMind is the most consistent overall, DINO is strong … view at source ↗

**Figure 3.** Figure 3: Per-task Q-score comparison of spatial aggregation method for ResNet-50 (left) and ViT-Small (right). We use final-layer embeddings with mean, min, or max pooling (or the CLS token for ViT) and average scores across models; negative scores are clipped to zero. For ResNet, mean pooling performs best across tasks, with max pooling outperforming min pooling. For ViT, mean pooling again performs best, with CLS… view at source ↗

**Figure 4.** Figure 4: Per-task and overall ∆R 2 (left) and ∆ Q-score (right) for embedding concatenation. Top: Intra-method concatenation (Mean + CLS within the same ViT-Small SSL4EO model). Bottom: Inter-method concatenation (Mean + Mean across different SSL objectives). For each task, the baseline is the stronger individual embedding, and we report ∆ = scoreconcat −scorebaseline (zero indicates no change). We additionally rep… view at source ↗

**Figure 5.** Figure 5: Layer-wise task-averaged performance (R 2 , left; Q-score, right). Top: ViT-Small; bottom: ResNet-50 (SSL4EO). Representations are extracted from each layer (12 transformer blocks; 5 ResNet stages with output dimensions 64, 256, 512, 1024, 2048); negative task values are clipped before averaging. ViT performance increases in early layers and then saturates, whereas ResNet shows an inverted-U pattern, peak… view at source ↗

**Figure 6.** Figure 6: Per-task R 2 comparison of ResNet-50 (left) and ViT-Small (right) FMs. Final-layer embeddings with mean pooling are used. In contrast to the main paper’s Q-score visualization, this plot reports raw predictive performance (R 2 ) per task. The overall ranking trends remain consistent: ResNet models perform strongly on semantic/land-cover tasks but show limited transfer beyond them, while ViT models are more… view at source ↗

**Figure 7.** Figure 7: Per-task R 2 comparison of spatial aggregation methods for ResNet-50 (left) and ViT-Small (right). Final-layer embeddings are evaluated using mean, min, or max pooling (and the CLS token for ViT), with scores averaged across models. This R 2 view confirms the Q-score trends reported in the main paper: mean pooling consistently yields the strongest performance across tasks and backbones. For ResNet, max poo… view at source ↗

**Figure 8.** Figure 8: Per-task R 2 radar plots for embedding concatenation experiments. We report results for all tested combinations, comparing the two individual baselines with their concatenated representation. The plots illustrate that concatenation typically preserves the stronger baseline and yields modest, task-dependent improvements. In particular, combinations such as SoftCon+DINO most consistently show positive deviat… view at source ↗

**Figure 9.** Figure 9: Layer-wise per-task downstream performance (R 2 ) for ViT-Small models pretrained on SSL4EO. Results are shown separately per task across layer depth. Semantic and land-cover targets exhibit increasing and saturating trends toward deeper layers, consistent with the averaged analysis in the main paper. Other tasks show early saturation or slight degradation at greater depth. 0 20 40 Q-Score Crops 0 10 Cloud… view at source ↗

**Figure 10.** Figure 10: Layer-wise per-task downstream performance (Q-score) for ViT-Small models pretrained on SSL4EO. The robustness trends largely mirror the R 2 behavior, confirming that depth-dependent effects are consistent across predictive accuracy and stability metrics [PITH_FULL_IMAGE:figures/full_fig_p015_10.png] view at source ↗

**Figure 11.** Figure 11: Layer-wise per-task downstream performance (R 2 ) for ResNet-50 models pretrained on SSL4EO. Semantic and landcover tasks show increasing and saturating trends similar to ViT models. In contrast, several other tasks exhibit a pronounced drop at the final-layer, intermediate layers frequently remain competitive with ViT final-layer embeddings. 0 20 40 Q-Score Crops 0 10 Clouds 0 25 50 Landcover Agricultur… view at source ↗

**Figure 12.** Figure 12: Layer-wise per-task downstream performance (Q-score) for ResNet-50 models pretrained on SSL4EO. The robustness metric reinforces the R 2 trends, highlighting stronger depth sensitivity in ResNet compared to ViT backbones [PITH_FULL_IMAGE:figures/full_fig_p016_12.png] view at source ↗

read the original abstract

Earth observation (EO) missions produce petabytes of multispectral imagery, increasingly analyzed using large Geospatial Foundation Models (GeoFMs). Alongside end-to-end adaptation, workflows make growing use of intermediate representations as task-agnostic embeddings, enabling models to compute representations once and reuse them across downstream tasks. Consequently, when GeoFMs act as feature extractors, decisions about how representations are obtained, aggregated, and combined affect downstream performance and pipeline scalability. Understanding these trade-offs is essential for scalable embedding-based EO workflows, where compact embeddings can replace raw data while remaining broadly useful. We present a systematic analysis of embedding design in GeoFM-based EO workflows. Leveraging NeuCo-Bench, we study how backbone architecture, pretraining strategy, representation depth, spatial aggregation, and representation combination influence EO task performance. We demonstrate the usability of GeoFM embeddings by aggregating them into fixed-size representations more than 500x smaller than the raw input data. Across models, we find consistent trends: transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives exhibit task-specific strengths, and combining embeddings from different objectives often improves robustness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This ablation gives some practical defaults for GeoFM embeddings in EO but the trends rest on one benchmark so their reach is still unclear.

read the letter

The main thing here is a systematic check of how backbone choice, layer depth, pooling, and objective mixing affect downstream EO performance when using GeoFMs as fixed feature extractors. They report a few steady patterns across models: transformers with mean pooling as a reliable default, intermediate ResNet layers sometimes beating the final layer, task-specific strengths from different self-supervised pretraining, and robustness gains when combining embeddings from multiple objectives. They also show the embeddings can be reduced to fixed-size vectors more than 500 times smaller than the raw imagery while remaining usable. That compression angle is the part most likely to matter for people running large-scale pipelines. The work stays empirical and benchmark-driven with no circular definitions or invented quantities, which keeps the claims grounded in external task performance. The experiments target a clear gap: prior work mostly looked at single models or end-to-end fine-tuning, not this combination of design axes for reusable embeddings. The soft spot is scope. All the reported orderings come from NeuCo-Bench, so it is unknown whether the same preferences survive changes in sensor, geography, or label distribution. The abstract gives no error bars, significance tests, or split details, which makes it difficult to judge how stable the trends actually are. If the benchmark carries correlated biases, the advice could be narrower than it appears. This is the sort of paper that helps engineers who already work with GeoFMs and need quick guidance on feature extraction rather than readers looking for new theory. It deserves peer review because the question is practical and timely for the subfield; referees can request extra datasets or statistical controls without altering the core contribution.

Referee Report

2 major / 2 minor

Summary. The manuscript evaluates embedding design choices for Geospatial Foundation Models (GeoFMs) in Earth Observation (EO) workflows using the NeuCo-Bench benchmark. It systematically analyzes the effects of backbone architecture, pretraining strategy, representation depth, spatial aggregation, and embedding combination on downstream task performance. The central findings are that transformer backbones with mean pooling provide strong default embeddings, intermediate ResNet layers can outperform final layers, self-supervised objectives have task-specific strengths, and combining embeddings from different objectives improves robustness, all while achieving over 500x compression compared to raw data.

Significance. If the reported trends hold beyond the evaluated benchmark, this work offers actionable insights for scalable EO pipelines that leverage GeoFMs as fixed feature extractors. The emphasis on compact, reusable embeddings addresses key challenges in handling petabyte-scale multispectral imagery, potentially guiding practitioners toward more efficient and robust workflows.

major comments (2)

Results section: The claims of 'consistent trends' (transformer+mean pooling as default, intermediate ResNet layers outperforming final layers, task-specific self-supervised strengths, and robustness from combinations) are presented without error bars, statistical significance tests, or details on dataset splits and controls for confounding factors, leaving the empirical support for these load-bearing findings only moderately substantiated.
Discussion section: The positioning of NeuCo-Bench trends as relevant to general GeoFM workflows (including 500x compression benefits) is not supported by any cross-dataset or cross-sensor validation; if NeuCo-Bench shares unaccounted biases in sensor characteristics or label distributions, the reported design preferences may not generalize and undermine the claimed utility.

minor comments (2)

Abstract: Expand 'EO' and 'GeoFM' on first use and provide a brief parenthetical definition of NeuCo-Bench for readers unfamiliar with the benchmark.
Methods: Include explicit pseudocode or equations for the spatial aggregation (e.g., mean pooling) and embedding combination procedures to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We address each major comment below, indicating the revisions we plan to make to strengthen the manuscript.

read point-by-point responses

Referee: Results section: The claims of 'consistent trends' (transformer+mean pooling as default, intermediate ResNet layers outperforming final layers, task-specific self-supervised strengths, and robustness from combinations) are presented without error bars, statistical significance tests, or details on dataset splits and controls for confounding factors, leaving the empirical support for these load-bearing findings only moderately substantiated.

Authors: We agree that the empirical support can be strengthened by including error bars, statistical tests, and additional methodological details. In the revised version, we will report error bars based on multiple random seeds or cross-validation folds for the key performance metrics. We will also include statistical significance tests (e.g., paired t-tests) for the main comparisons supporting our 'consistent trends' claims. Furthermore, we will expand the experimental setup section to provide complete details on dataset splits, preprocessing, and any controls for confounding factors. These additions will make the results more robustly substantiated. revision: yes
Referee: Discussion section: The positioning of NeuCo-Bench trends as relevant to general GeoFM workflows (including 500x compression benefits) is not supported by any cross-dataset or cross-sensor validation; if NeuCo-Bench shares unaccounted biases in sensor characteristics or label distributions, the reported design preferences may not generalize and undermine the claimed utility.

Authors: We acknowledge this as a valid limitation of the current study. While NeuCo-Bench includes a variety of EO tasks and sensor modalities to promote diversity, we agree that broader cross-dataset and cross-sensor validation would enhance generalizability claims. In the revision, we will temper the language in the discussion to specify that the observed trends hold within the NeuCo-Bench benchmark and discuss potential biases related to sensor characteristics and label distributions. We will retain the 500x compression claim as it is a direct comparison of embedding size to raw data size, independent of the specific benchmark, but clarify its applicability to embedding-based workflows in general. We will also add a section on limitations and future work to address generalization. revision: partial

Circularity Check

0 steps flagged

No significant circularity: pure empirical benchmarking study

full rationale

The paper conducts a systematic empirical evaluation of EO embedding design choices (backbone architecture, pretraining strategy, representation depth, spatial aggregation, and combination) by measuring downstream task performance on the external NeuCo-Bench benchmark. No mathematical derivations, fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations are present. All reported trends (e.g., transformer+mean pooling as strong default, intermediate ResNet layers outperforming final layers) are direct observations from benchmark metrics and do not reduce to the paper's own inputs by construction. The analysis is self-contained against the stated benchmark without circular reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Empirical evaluation paper that introduces no new mathematical objects or fitted parameters; relies on existing foundation models and the NeuCo-Bench benchmark.

axioms (1)

domain assumption NeuCo-Bench is representative of real-world EO tasks and datasets.
All reported trends depend on this benchmark serving as a valid proxy.

pith-pipeline@v0.9.0 · 5520 in / 1132 out tokens · 45158 ms · 2026-05-15T13:38:37.183423+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages · 2 internal anchors

[1]

Ter- ramesh: A planetary mosaic of multimodal earth observation data.arXiv preprint arXiv:2504.11172, 2025

Benedikt Blumenstiel, Paolo Fraccaro, Valerio Marsocci, Johannes Jakubik, Stefano Maurogiovanni, Mikolaj Cz- erkawski, Rocco Sedona, Gabriele Cavallaro, Thomas Brun- schwiler, Juan Bernabe-Moreno, and Nicolas Long´ep´e. Ter- ramesh: A planetary mosaic of multimodal earth observation data.arXiv preprint arXiv:2504.11172, 2025. 5

work page arXiv 2025
[2]

Ssl4eo-s12 v1. 1: A multimodal, multiseasonal dataset for pretraining, updated,

Benedikt Blumenstiel, Nassim Ait Ali Braham, Conrad M. Albrecht, Stefano Maurogiovanni, and Paolo Fraccaro. Ssl4eo-s12 v1.1: A multimodal, multiseasonal dataset for pretraining, updated.arXiv preprint arXiv:2503.00168,

work page arXiv
[3]

On the Opportunities and Risks of Foundation Models

Rishi Bommasani, Drew A Hudson, Ehsan Adeli, Russ Alt- man, Simran Arora, Sydney von Arx, Michael S Bernstein, Jeannette Bohg, Antoine Bosselut, Emma Brunskill, and oth- ers. On the opportunities and risks of foundation models. arXiv preprint arXiv:2108.07258, 2021. 1

work page internal anchor Pith review Pith/arXiv arXiv 2021
[4]

Brown, Michal R

Christopher F. Brown, Michal R. Kazmierski, Valerie J. Pasquarella, William J. Rucklidge, Masha Samsikova, Chen- hui Zhang, Evan Shelhamer, Estefania Lahera, Olivia Wiles, Simon Ilyushchenko, Noel Gorelick, Lihui Lydia Zhang, Sophia Alj, Emily Schechter, Sean Askay, Oliver Guinan, Rebecca Moore, Alexis Boukouvalas, and Pushmeet Kohli. AlphaEarth Foundatio...

work page
[5]

Emerg- ing properties in self-supervised vision transformers

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv ´e J´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. InPro- ceedings of the IEEE/CVF international conference on com- puter vision, pages 9650–9660, 2021. 2, 3

work page 2021
[6]

Global and dense embeddings of Earth: Major TOM floating in the latent space,

Mikolaj Czerkawski, Marcin Kluczek, J ¨A Bojanowski, and others. Global and dense embeddings of earth: Ma- jor tom floating in the latent space.arXiv preprint arXiv:2412.05600, 2024. 2

work page arXiv 2024
[7]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929, 2020. 2, 3

work page internal anchor Pith review Pith/arXiv arXiv 2010
[8]

SSL4EO-S12-downstream.https : / / huggingface

Embed2Scale. SSL4EO-S12-downstream.https : / / huggingface . co / datasets / embed2scale / SSL4EO - S12 - downstream, 2025. Hugging Face dataset. 3

work page 2025
[9]

Sentinel mission overview,

ESA, SentiwikiCopernicus. Sentinel mission overview,

work page
[10]

Accessed: 2025-08-02. 1

work page 2025
[11]

Li- saius, Markus Immitzer, Toby Jackson, James Ball, David A

Zhengpeng Feng, Clement Atzberger, Sadiq Jaffer, Jovana Knezevic, Silja Sormunen, Robin Young, Madeline C. Li- saius, Markus Immitzer, Toby Jackson, James Ball, David A. Coomes, Anil Madhavapeddy, Andrew Blake, and Srini- vasan Keshav. TESSERA: Temporal embeddings of surface spectra for earth representation and analysis, 2025. 2

work page 2025
[12]

Ter- ratorch: The geospatial foundation models toolkit.arXiv preprint arXiv:2503.20563, 2025

Carlos Gomes, Benedikt Blumenstiel, Joao Lucas de Sousa Almeida, Pedro Henrique de Oliveira, Paolo Frac- caro, Francesc Marti Escofet, Daniela Szwarcman, Naomi Simumba, Romeo Kienzler, and Bianca Zadrozny. Ter- ratorch: The geospatial foundation models toolkit.arXiv preprint arXiv:2503.20563, 2025. 3

work page arXiv 2025
[13]

Earth obser- vation big data for climate change research.Advances in Climate Change Research, 6(2):108–117, 2015

Hua-Dong Guo, Li Zhang, and Lan-Wei Zhu. Earth obser- vation big data for climate change research.Advances in Climate Change Research, 6(2):108–117, 2015. Publisher: Elsevier. 1

work page 2015
[14]

Deep residual learning for image recognition

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition. In2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. 3

work page 2016
[15]

Momentum contrast for unsupervised visual rep- resentation learning

Kaiming He, Haoqi Fan, Yuxin Wu, Saining Xie, and Ross Girshick. Momentum contrast for unsupervised visual rep- resentation learning. InProceedings of the IEEE/CVF con- ference on computer vision and pattern recognition, pages 9729–9738, 2020. 2, 3

work page 2020
[16]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 2, 3

work page 2022
[17]

Object detection and image segmentation with deep learning on earth observation data: A review-part i: Evolution and recent trends.Remote Sensing, 12(10):1667, 2020

Thorsten Hoeser and Claudia Kuenzer. Object detection and image segmentation with deep learning on earth observation data: A review-part i: Evolution and recent trends.Remote Sensing, 12(10):1667, 2020. Publisher: MDPI. 2

work page 2020
[18]

TerraMind-1.0-small.https : / / huggingface

IBM ESA Geospatial. TerraMind-1.0-small.https : / / huggingface . co / ibm - esa - geospatial / TerraMind- 1.0- small, 2025. Hugging Face model release. 3

work page 2025
[19]

Terramind: Large-scale generative multi- modality for earth observation

Johannes Jakubik, Felix Yang, Benedikt Blumenstiel, Erik Scheurer, Rocco Sedona, Stefano Maurogiovanni, Jente Bosmans, Nikolaos Dionelis, Valerio Marsocci, Niklas Kopp, Rahul Ramachandran, Paolo Fraccaro, Thomas Brun- schwiler, Gabriele Cavallaro, Juan Bernabe-Moreno, and Nicolas Long´ep´e. Terramind: Large-scale generative multi- modality for earth obser...

work page
[20]

Satclip: Global, general- purpose location embeddings with satellite imagery

Konstantin Klemmer, Esther Rolf, Caleb Robinson, Lester Mackey, and Marc Rußwurm. Satclip: Global, general- purpose location embeddings with satellite imagery. InPro- ceedings of the AAAI Conference on Artificial Intelligence, pages 4347–4355, 2025. Issue: 4. 2

work page 2025
[21]

Earth embeddings: Towards ai-centric representations of our planet.EarthArXiv preprint, 2025

Konstantin Klemmer, Esther Rolf, Marc Rußwurm, Gus- tau Camps-Valls, Mikolaj Czerkawski, Stefano Ermon, Alis- tair Francis, Nathan Jacobs, Hannah Kerner, Lester Mackey, Gengchen Mai, Oisin Mac Aodha, Markus Reichstein, Caleb Robinson, David Rolnick, Evan Shelhamer, Vincent Sitz- mann, Devis Tuia, and Xiao Xiang Zhu. Earth embeddings: Towards ai-centric re...

work page 2025
[22]

Geo-bench: Toward foundation models for earth monitor- ing.Advances in Neural Information Processing Systems, 36:51080–51093, 2023

Alexandre Lacoste, Nils Lehmann, Pau Rodriguez, Evan Sherwin, Hannah Kerner, Bj¨orn L¨utjens, Jeremy Irvin, David Dao, Hamed Alemohammad, Alexandre Drouin, and others. Geo-bench: Toward foundation models for earth monitor- ing.Advances in Neural Information Processing Systems, 36:51080–51093, 2023. 2

work page 2023
[23]

PANGAEA: A global and inclusive benchmark for geospatial foundation models,

Valerio Marsocci, Yuru Jia, Georges Le Bellier, David Kerekes, Liang Zeng, Sebastian Hafner, Sebastian Gerard, Eric Brune, Ritu Yadav, Ali Shibli, and others. Pangaea: A global and inclusive benchmark for geospatial foundation models.arXiv preprint arXiv:2412.04204, 2024. 2

work page arXiv 2024
[24]

Rethinking transformers pre-training for multi- spectral satellite imagery

Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shah- baz Khan. Rethinking transformers pre-training for multi- spectral satellite imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27811–27819, 2024. 8

work page 2024
[25]

Do vision trans- formers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128,

Maithra Raghu, Thomas Unterthiner, Simon Kornblith, Chiyuan Zhang, and Alexey Dosovitskiy. Do vision trans- formers see like convolutional neural networks?Advances in neural information processing systems, 34:12116–12128,

work page
[26]

GEO- Bench-2: From performance to capability, rethinking eval- uation in geospatial AI, 2026

Naomi Simumba, Nils Lehmann, Paolo Fraccaro, Hamed Alemohammad, Geeth De Mel, Salman Khan, Manil Maskey, Nicolas Longepe, Xiao Xiang Zhu, Hannah Kerner, Juan Bernabe-Moreno, and Alexandre Lacoste. GEO- Bench-2: From performance to capability, rethinking eval- uation in geospatial AI, 2026. 2

work page 2026
[27]

Stewart, Caleb Robinson, Isaac A

Adam J. Stewart, Caleb Robinson, Isaac A. Corley, Anthony Ortiz, Juan M. Lavista Ferres, and Arindam Banerjee. Torch- Geo: Deep learning with geospatial data.ACM Trans. Spa- tial Algorithms Syst., 11(4):1–28, 2025. 3

work page 2025
[28]

Albrecht

Rikard Vinge, Isabelle Wittmann, Jannik Schneider, Michael Marszalek, Luis Gilch, Thomas Brunschwiler, and Con- rad M. Albrecht. Neuco-bench: A novel benchmark frame- work for neural embeddings in earth observation.arXiv preprint arXiv:2510.17914, 2025. 2, 3, 5

work page arXiv 2025
[29]

Albrecht, Nassim Ait Ali Braham, Lichao Mou, and Xiao Xiang Zhu

Yi Wang, Conrad M. Albrecht, Nassim Ait Ali Braham, Lichao Mou, and Xiao Xiang Zhu. Self-supervised learning in remote sensing: A review.IEEE Geoscience and Remote Sensing Magazine, 11(3):22–51, 2023. 1

work page 2023
[30]

Y . Wang, N. A. A. Braham, Z. Xiong, C. Liu, C. M. Albrecht, and X. X. Zhu. Ssl4eo-s12: A large-scale multimodal, mul- titemporal dataset for self-supervised learning in earth ob- servation.IEEE Geosci. Remote Sens. Mag., 11(3):98–106,

work page
[31]

Decou- pling common and unique representations for multimodal self-supervised learning

Yi Wang, Conrad M Albrecht, Nassim Ait Ali Braham, Chenying Liu, Zhitong Xiong, and Xiao Xiang Zhu. Decou- pling common and unique representations for multimodal self-supervised learning. InEuropean Conference on Com- puter Vision, pages 286–303. Springer, 2024. 2, 3

work page 2024
[32]

Multi- label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining.IEEE Transactions on Geoscience and Remote Sensing, 2024

Yi Wang, Conrad M Albrecht, and Xiao Xiang Zhu. Multi- label Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining.IEEE Transactions on Geoscience and Remote Sensing, 2024. Publisher: IEEE. 2, 3

work page 2024
[33]

Feature guided masked autoencoder for self-supervised learning in remote sensing.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024

Yi Wang, Hugo Hern ´andez Hern´andez, Conrad M Albrecht, and Xiao Xiang Zhu. Feature guided masked autoencoder for self-supervised learning in remote sensing.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2024. Publisher: IEEE. 2, 3

work page 2024
[34]

Wilkinson, M.M

R. Wilkinson, M.M. Mleczko, R.J.W. Brewin, K.J. Gaston, M. Mueller, J.D. Shutler, X. Yan, and K. Anderson. Environ- mental impacts of earth observation data in the constellation and cloud computing era.Science of The Total Environment, 909:168584, 2024. 1, 2

work page 2024
[35]

Foundation models for remote sensing and earth observation: A sur- vey.IEEE Geoscience and Remote Sensing Magazine, 13 (4):297–324, 2025

Aoran Xiao, Weihao Xuan, Junjue Wang, Jiaxing Huang, Dacheng Tao, Shijian Lu, and Naoto Yokoya. Foundation models for remote sensing and earth observation: A sur- vey.IEEE Geoscience and Remote Sensing Magazine, 13 (4):297–324, 2025. 1, 2

work page 2025
[36]

Earthnets: Empowering ai in earth obser- vation.arXiv preprint arXiv:2210.04936, 2022

Zhitong Xiong, Fahong Zhang, Yi Wang, Yilei Shi, and Xiao Xiang Zhu. Earthnets: Empowering ai in earth obser- vation.arXiv preprint arXiv:2210.04936, 2022. 2

work page arXiv 2022
[37]

Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu

Zhitong Xiong, Yi Wang, Fahong Zhang, Adam J. Stewart, Jo¨elle Hanna, Damian Borth, Ioannis Papoutsis, Bertrand Le Saux, Gustau Camps-Valls, and Xiao Xiang Zhu. Neural plasticity-inspired multimodal foundation model for earth observation, 2024. 2

work page 2024
[38]

Deep learning in remote sensing: A review.IEEE Geoscience and Remote Sensing Magazine, 5(4):8–36, 2017

Xiao Xiang Zhu, Devis Tuia, Lichao Mou, Gui-Song Xia, Liangpei Zhang, Feng Xu, and Friedrich Fraundorfer. Deep learning in remote sensing: A review.IEEE Geoscience and Remote Sensing Magazine, 5(4):8–36, 2017. 1 How to Embed Matters: Evaluation of EO Embedding Design Choices Supplementary Material In the supplementary material, we provide additional re- s...

work page arXiv 2017