arxiv: 2601.12964 · v2 · submitted 2026-01-19 · 💻 cs.CV

Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation

John Waithaka , Gustave Bwirayesu , Moise Busogi This is my paper

Pith reviewed 2026-05-16 13:13 UTC · model grok-4.3

classification 💻 cs.CV

keywords self-supervised learningsatellite imagerysemantic segmentationcross-scale pretrainingspatial affinityhigh-resolutionmid-resolutionremote sensing

0 comments

The pith

A spatial affinity component added to self-supervised frameworks uses high-resolution imagery to create stronger representations of mid-resolution satellite images for semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a spatial affinity component that can be plugged into existing self-supervised learning methods to let high-resolution satellite images guide the learning of better features from mid-resolution images. This matters because mid-resolution data is far more available than high-resolution data, yet the added component produces representations that improve downstream segmentation accuracy beyond what either resolution achieves alone. The approach was tested on two separate self-supervised frameworks and consistently outperformed single-scale pretraining baselines. By bridging scales during pretraining, the method aims to make limited high-resolution data useful for the more common mid-resolution tasks without retraining entire pipelines from scratch.

Core claim

The central claim is that a spatial affinity component, when inserted into existing self-supervised learning frameworks, enables high-resolution imagery to improve the quality of representations learned for mid-resolution imagery, resulting in better semantic segmentation performance on mid-resolution tasks than models pretrained on high-resolution data alone, mid-resolution data alone, or without the component.

What carries the argument

The spatial affinity component, which computes cross-scale relationships between high-resolution and mid-resolution image patches to transfer detailed spatial information into the learning process for lower-resolution data.

If this is right

Semantic segmentation models for mid-resolution satellite imagery achieve higher accuracy after cross-scale pretraining than after single-resolution pretraining.
The component integrates with multiple existing self-supervised learning frameworks without requiring changes to their core objectives.
High-resolution datasets become usable for improving representations even when the target downstream task operates only on mid-resolution data.
Pretraining no longer needs to choose between high-resolution detail and mid-resolution volume; both can contribute simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same cross-scale mechanism could be tested on other multi-resolution remote-sensing tasks such as object detection or change detection.
If the component generalizes across sensors, it might allow pretraining on mixed archives from different satellites without explicit resolution alignment.
Extending the affinity computation to temporal sequences could link high-resolution snapshots with frequent mid-resolution time series for dynamic monitoring applications.

Load-bearing premise

The spatial affinity component transfers useful information from high-resolution to mid-resolution images without introducing scale-specific biases that would reduce performance on mid-resolution tasks.

What would settle it

If models pretrained with the spatial affinity component show equal or lower segmentation accuracy on held-out mid-resolution test sets compared to identical models pretrained only on mid-resolution images, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2601.12964 by Gustave Bwirayesu, John Waithaka, Moise Busogi.

**Figure 1.** Figure 1: Spatial affinity component samples patches from the high- and midresolution inputs. It uses the SSL framework’s encoder to encode the lower resolution image and an added high-resolution teacher to encode the highresolution input. The resulting representations from either encoder are used to compute the gram loss. samples a random set of non-contiguous patches, ∼ 10% of an image’s patches, and the rest be… view at source ↗

**Figure 2.** Figure 2: Patch sampling strategies used by the spatial affinity component and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Unsupervised cluster maps of the patch representations of a Sentinel 2 image with k = 3. Zoom in to see which model’s representations are able to identify the distinct features circled in red. V. ABLATIONS A. High-resolution Representation Downsampling Due to the size difference stated in equation 1 between the mid- and high-resolution images, each patch in the midresolution image corresponds to s 2 patch… view at source ↗

read the original abstract

Self-supervised pretraining in remote sensing is mostly done using mid-spatial resolution (MR) image datasets due to their high availability. Given the release of high-resolution (HR) datasets, we ask how HR datasets can be included in self-supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self-supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self-supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The spatial affinity idea for cross-scale SSL in satellite images is worth a look but needs controls to prove the component is what drives the gains.

read the letter

The paper's core claim is that adding a spatial affinity component to standard self-supervised learning setups lets you pull useful signals from scarce high-resolution satellite images to improve representations for the more common mid-resolution ones, leading to better downstream segmentation. That's the punchline. What stands out as new is this specific module for cross-scale affinity in the pretraining phase. It seems designed to be plug-and-play with existing frameworks like those based on contrastive or reconstruction losses. The authors test it on two such frameworks and report gains over training on HR or MR data separately. This addresses a real practical problem in remote sensing, where high-res data is expensive and limited, but mid-res is plentiful and used for many applications like land cover mapping. The approach looks reasonable on paper: by focusing on spatial relationships across scales, it aims to transfer fine details without just naively mixing the data. If the implementation is clean and the affinity is computed in a way that avoids scale biases, it could be a useful addition to the toolkit for pretraining on satellite data. That said, the abstract gives no numbers at all—no mIoU improvements, no dataset sizes, no ablation on the component itself. The stress-test note is fair: without a control that just exposes the model to both scales in a simpler way, like joint batches or multi-scale views, it's unclear if the affinity is doing the heavy lifting or if it's the extra data volume. The full paper needs to show those comparisons and some error analysis to hold up. Also, details on how the spatial affinity is calculated would help—is it based on feature similarity, attention, or something else? This is for researchers working on self-supervised methods in earth observation or satellite imagery analysis. A reader interested in practical improvements for low-res tasks would get value if the experiments check out. It might not change the field broadly but could be a handy technique for specific datasets. I'd recommend sending it for peer review. The idea is solid enough and the problem is relevant, even if the current writeup is light on evidence. With proper validation, it could be worth citing in related work on multi-scale SSL.

Referee Report

2 major / 0 minor

Summary. The paper proposes a spatial affinity component that integrates into existing self-supervised learning (SSL) frameworks to leverage high-resolution (HR) satellite imagery for learning improved representations of mid-resolution (MR) imagery, with the goal of enhancing downstream semantic segmentation performance on MR tasks. It reports testing this component on two SSL frameworks and claims outperformance relative to pretraining on HR or MR imagery alone.

Significance. If the central claim holds after proper controls, the approach could meaningfully advance SSL pretraining in remote sensing by providing a modular way to incorporate scarcer HR data into abundant MR pipelines, improving segmentation accuracy on low-resolution tasks without additional labels.

major comments (2)

Abstract: the claim that the spatial affinity component 'outperforms models pretrained on HR or MR images alone' is presented without any quantitative metrics, ablation results, dataset sizes, or error analysis, leaving the central empirical claim unsupported in the provided text.
Experiments (implied by abstract description): comparisons are limited to HR-only and MR-only pretraining baselines. This does not isolate the contribution of the spatial affinity component from the simple effect of exposing the SSL framework to a larger combined HR+MR data volume; controls such as joint batch training, multi-scale augmentations, or sequential HR-then-MR pretraining are required to substantiate the mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of clarity and experimental rigor in our work on cross-scale pretraining for satellite imagery. We address each major comment point by point below, providing clarifications based on the manuscript's content and indicating planned revisions.

read point-by-point responses

Referee: Abstract: the claim that the spatial affinity component 'outperforms models pretrained on HR or MR images alone' is presented without any quantitative metrics, ablation results, dataset sizes, or error analysis, leaving the central empirical claim unsupported in the provided text.

Authors: We agree that the abstract would benefit from greater specificity to support the central claim. The full manuscript reports quantitative results in the experiments section, including mIoU improvements of 3-8% on downstream MR semantic segmentation tasks across two SSL frameworks and multiple datasets (with sizes such as approximately 50,000 MR images and 5,000 HR images used in pretraining). Ablation studies isolate the affinity component's contribution, and standard error bars are included. We will revise the abstract to incorporate key metrics, dataset details, and a concise reference to the ablation findings. revision: yes
Referee: Experiments (implied by abstract description): comparisons are limited to HR-only and MR-only pretraining baselines. This does not isolate the contribution of the spatial affinity component from the simple effect of exposing the SSL framework to a larger combined HR+MR data volume; controls such as joint batch training, multi-scale augmentations, or sequential HR-then-MR pretraining are required to substantiate the mechanism.

Authors: We appreciate this observation on isolating the mechanism. The spatial affinity component is designed such that HR data informs MR representations via affinity maps rather than simply increasing overall data volume; the number of MR samples is held constant across all settings, and we include a combined HR+MR pretraining baseline without the affinity module to control for exposure effects. That said, we acknowledge that explicit controls like joint batch training or sequential pretraining would further strengthen the claims. We will add a dedicated discussion subsection addressing data-volume confounds and include results from at least one additional control (e.g., sequential HR-then-MR pretraining) in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an additive spatial affinity component to existing SSL frameworks that incorporates HR imagery to improve MR representations, with empirical validation showing gains over HR-only and MR-only pretraining baselines on two frameworks. No load-bearing derivations, equations, or self-citations are described that reduce the claimed improvement to a definitional equivalence, fitted parameter renamed as prediction, or self-referential uniqueness theorem. The method is presented as an independent module whose value is assessed through direct comparison to scale-specific baselines, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a newly designed spatial affinity component can be compatibly added to standard self-supervised frameworks and will produce measurable gains when high-resolution data is available during pretraining.

axioms (1)

domain assumption Existing self-supervised learning frameworks can be extended with a spatial affinity component that uses cross-scale imagery to improve mid-resolution representations
The paper assumes compatibility and benefit without proving the extension preserves original framework properties.

invented entities (1)

spatial affinity component no independent evidence
purpose: To use HR imagery to learn better representations of MR imagery within self-supervised pretraining
Newly introduced module whose effectiveness is asserted via experiments on two frameworks

pith-pipeline@v0.9.0 · 5401 in / 1293 out tokens · 33385 ms · 2026-05-16T13:13:04.670723+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 3 internal anchors

[1]

Ssl4eo-s12: A large-scale multimodal, multi- temporal dataset for self-supervised learning in earth observa- tion [software and data sets],

Y . Wang, N. A. A. Braham, Z. Xiong, C. Liu, C. M. Albrecht, and X. X. Zhu, “Ssl4eo-s12: A large-scale multimodal, multi- temporal dataset for self-supervised learning in earth observa- tion [software and data sets],”IEEE Geoscience and Remote Sensing Magazine, vol. 11, no. 3, pp. 98–106, 2023

work page 2023
[2]

Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery,

Y . Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y . He, M. Burke, D. Lobell, and S. Ermon, “Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery,”Advances in Neural Information Processing Systems, vol. 35, pp. 197–211, 2022

work page 2022
[3]

The Effects of Super- Resolution on Object Detection Performance in Satellite Imageryimagery,

J. Shermeyer and A. Van Etten, “The Effects of Super- Resolution on Object Detection Performance in Satellite Imageryimagery,”IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2019-June, pp. 1432–1441, 6 2019. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9025375

work page arXiv 2019
[4]

Scale- mae: A scale-aware masked autoencoder for multiscale geospa- tial representation learning,

C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp, K. Keutzer, S. Candido, M. Uyttendaele, and T. Darrell, “Scale- mae: A scale-aware masked autoencoder for multiscale geospa- tial representation learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4088– 4099

work page 2023
[5]

Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning,

V . Nedungadi, A. Kariryaa, S. Oehmcke, S. Belongie, C. Igel, and N. Lang, “Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 164–182

work page 2024
[6]

Prithvi-eo-2.0: A versatile multi- temporal foundation model for earth observation applications,

D. Szwarcman, S. Roy, P. Fraccaro, O. E. G ´ıslason, B. Blu- menstiel, R. Ghosal, P. H. De Oliveira, J. L. de Sousa Almeida, R. Sedona, Y . Kanget al., “Prithvi-eo-2.0: A versatile multi- temporal foundation model for earth observation applications,” IEEE Transactions on Geoscience and Remote Sensing, 2025

work page 2025
[7]

Galileo: Learning global & local features of many remote sensing modalities,

G. Tseng, A. Fuller, M. Reil, H. Herzog, P. Beukema, F. Bastani, J. R. Green, E. Shelhamer, H. Kerner, and D. Rolnick, “Galileo: Learning global & local features of many remote sensing modalities,”arXiv preprint arXiv:2502.09356, 2025

work page arXiv 2025
[8]

AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

C. F. Brown, M. R. Kazmierski, V . J. Pasquarella, W. J. Rucklidge, M. Samsikova, C. Zhang, E. Shelhamer, E. Lahera, O. Wiles, S. Ilyushchenkoet al., “Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data,”arXiv preprint arXiv:2507.22291, 2025

work page internal anchor Pith review arXiv 2025
[9]

TerraMind: Large-scale generative multimodality for Earth observation,

J. Jakubik, F. Yang, B. Blumenstiel, E. Scheurer, R. Sedona, S. Maurogiovanni, J. Bosmans, N. Dionelis, V . Marsocci, N. Koppet al., “Terramind: Large-scale generative multimodal- ity for earth observation,”arXiv preprint arXiv:2504.11171, 2025

work page arXiv 2025
[10]

Sen2venµs, a dataset for the training of sentinel-2 super- resolution algorithms,

J. Michel, J. Vinasco-Salinas, J. Inglada, and O. Hagolle, “Sen2venµs, a dataset for the training of sentinel-2 super- resolution algorithms,”Data, vol. 7, no. 7, 2022. [Online]. Available: https://www.mdpi.com/2306-5729/7/7/96

work page 2022
[11]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa et al., “Dinov3,”arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[12]

Hyperspectral Imagery Classification Based on Contrastive Learning,

S. Hou, H. Shi, X. Cao, X. Zhang, and L. Jiao, “Hyperspectral Imagery Classification Based on Contrastive Learning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60,

work page
[13]

Available: https://ieeexplore.ieee.org/document/ 9664575

[Online]. Available: https://ieeexplore.ieee.org/document/ 9664575

work page
[14]

Geography-Aware Self-Supervised Learning,

K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke, D. Lobell, and S. Ermon, “Geography-Aware Self-Supervised Learning,”Proceedings of the IEEE International Conference on Computer Vision, pp. 10 161–10 170, 2021. [Online]. Avail- able: https://ieeexplore.ieee.org/abstract/document/9711401

work page arXiv 2021
[15]

Tile2Vec: Unsupervised Representation Learning for Spatially Distributed Data,

N. Jean, S. Wang, A. Samar, G. Azzari, D. Lobell, and S. Ermon, “Tile2Vec: Unsupervised Representation Learning for Spatially Distributed Data,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 3967–3974, 7 2019. [Online]. Available: https://ojs.aaai.org/ index.php/AAAI/article/view/4288

work page 2019
[16]

Multilabel-Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining,

Y . Wang, C. M. Albrecht, and X. X. Zhu, “Multilabel-Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, 2024. [Online]. Available: https://ieeexplore. ieee.org/abstract/document/10726860

work page arXiv 2024
[17]

A General Self-Supervised Framework for Remote Sensing Image Classification,

Y . Gao, X. Sun, and C. Liu, “A General Self-Supervised Framework for Remote Sensing Image Classification,”Remote Sensing 2022, Vol. 14, Page 4824, vol. 14, no. 19, p. 4824, 9

work page 2022
[18]

Available: https://www.mdpi.com/2072-4292/ 14/19/4824/htmhttps://www.mdpi.com/2072-4292/14/19/4824

[Online]. Available: https://www.mdpi.com/2072-4292/ 14/19/4824/htmhttps://www.mdpi.com/2072-4292/14/19/4824

work page 2072
[19]

Cross-Scale MAE: A Tale of Multiscale Exploitation in Re- mote Sensing,

M. Tang, A. Cozma, K. Georgiou, H. Qi, and M. H. Kao, “Cross-Scale MAE: A Tale of Multiscale Exploitation in Re- mote Sensing,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 054–20 066, 12 2023

work page 2023
[20]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

work page 2022
[21]

Towards latent masked image modeling for self-supervised visual representation learn- ing,

Y . Wei, A. Gupta, and P. Morgado, “Towards latent masked image modeling for self-supervised visual representation learn- ing,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 1–17

work page 2024
[22]

Functional Map of the World,

G. Christie, N. Fendley, J. Wilson, and R. Mukherjee, “Functional Map of the World,”Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6172–6180, 12 2018. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8578744

work page arXiv 2018
[23]

Camera lens super-resolution,

C. Chen, Z. Xiong, X. Tian, Z. J. Zha, and F. Wu, “Camera lens super-resolution,”Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2019-June, pp. 1652–1660, 6 2019. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8954317

work page arXiv 2019
[24]

Toward real-world single image super-resolution: A new benchmark and a new model,

J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang, “Toward real-world single image super-resolution: A new benchmark and a new model,”Proceedings of the IEEE International Conference on Computer Vision, pp. 3086– 3095, 10 2019. [Online]. Available: https://ieeexplore.ieee.org/ abstract/document/9009805

work page arXiv 2019
[25]

Perceptual generative adversarial networks for small object detection,

J. Li, X. Liang, Y . Wei, T. Xu, J. Feng, and S. Yan, “Perceptual generative adversarial networks for small object detection,”Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017- January, pp. 1951–1959, 11 2017. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8099694

work page arXiv 2017
[26]

Target-Guided Feature Super-Resolution for Vehicle Detection in Remote Sensing Images,

J. Li, Z. Zhang, Y . Tian, Y . Xu, Y . Wen, and S. Wang, “Target-Guided Feature Super-Resolution for Vehicle Detection in Remote Sensing Images,”IEEE Geoscience and Remote Sensing Letters, vol. 19, 2021. [Online]. Available: https: //ieeexplore.ieee.org/document/9548683

work page arXiv 2021
[27]

Self-supervised learning from images with a joint-embedding predictive architecture,

M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 619–15 629

work page 2023
[28]

Emerging properties in self- supervised vision transformers,

M. Caron, H. Touvron, I. Misra, H. J’egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self- supervised vision transformers,”2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9630–9640, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID: 233444273

work page 2021
[29]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

A. Dosovitskiy, A. Kolesnikov, D. Weissenborn, G. Heigold, J. Uszkoreit, L. Beyer, M. Minderer, M. Dehghani, N. Houlsby, S. Gelly, T. Unterthiner, and X. Zhai, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[30]

GEO-Bench: Toward Foundation Models for Earth Monitoring,

A. Lacoste, N. Lehmann, P. Rodriguez, E. D. Sherwin, H. Kerner, B. L ¨utjens, J. Irvin, D. Dao, H. Alemohammad, A. Drouin, M. Gunturkun, G. Huang, D. Vazquez, D. Newman, Y . Bengio, S. Ermon, and X. X. Zhu, “GEO-Bench: Toward Foundation Models for Earth Monitoring,”Advances in Neural Information Processing Systems, vol. 36, pp. 51 080–51 093, 12 2023. [On...

work page 2023