Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation
Pith reviewed 2026-05-16 13:13 UTC · model grok-4.3
The pith
A spatial affinity component added to self-supervised frameworks uses high-resolution imagery to create stronger representations of mid-resolution satellite images for semantic segmentation.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a spatial affinity component, when inserted into existing self-supervised learning frameworks, enables high-resolution imagery to improve the quality of representations learned for mid-resolution imagery, resulting in better semantic segmentation performance on mid-resolution tasks than models pretrained on high-resolution data alone, mid-resolution data alone, or without the component.
What carries the argument
The spatial affinity component, which computes cross-scale relationships between high-resolution and mid-resolution image patches to transfer detailed spatial information into the learning process for lower-resolution data.
If this is right
- Semantic segmentation models for mid-resolution satellite imagery achieve higher accuracy after cross-scale pretraining than after single-resolution pretraining.
- The component integrates with multiple existing self-supervised learning frameworks without requiring changes to their core objectives.
- High-resolution datasets become usable for improving representations even when the target downstream task operates only on mid-resolution data.
- Pretraining no longer needs to choose between high-resolution detail and mid-resolution volume; both can contribute simultaneously.
Where Pith is reading between the lines
- The same cross-scale mechanism could be tested on other multi-resolution remote-sensing tasks such as object detection or change detection.
- If the component generalizes across sensors, it might allow pretraining on mixed archives from different satellites without explicit resolution alignment.
- Extending the affinity computation to temporal sequences could link high-resolution snapshots with frequent mid-resolution time series for dynamic monitoring applications.
Load-bearing premise
The spatial affinity component transfers useful information from high-resolution to mid-resolution images without introducing scale-specific biases that would reduce performance on mid-resolution tasks.
What would settle it
If models pretrained with the spatial affinity component show equal or lower segmentation accuracy on held-out mid-resolution test sets compared to identical models pretrained only on mid-resolution images, the claim would be falsified.
Figures
read the original abstract
Self-supervised pretraining in remote sensing is mostly done using mid-spatial resolution (MR) image datasets due to their high availability. Given the release of high-resolution (HR) datasets, we ask how HR datasets can be included in self-supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self-supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self-supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a spatial affinity component that integrates into existing self-supervised learning (SSL) frameworks to leverage high-resolution (HR) satellite imagery for learning improved representations of mid-resolution (MR) imagery, with the goal of enhancing downstream semantic segmentation performance on MR tasks. It reports testing this component on two SSL frameworks and claims outperformance relative to pretraining on HR or MR imagery alone.
Significance. If the central claim holds after proper controls, the approach could meaningfully advance SSL pretraining in remote sensing by providing a modular way to incorporate scarcer HR data into abundant MR pipelines, improving segmentation accuracy on low-resolution tasks without additional labels.
major comments (2)
- Abstract: the claim that the spatial affinity component 'outperforms models pretrained on HR or MR images alone' is presented without any quantitative metrics, ablation results, dataset sizes, or error analysis, leaving the central empirical claim unsupported in the provided text.
- Experiments (implied by abstract description): comparisons are limited to HR-only and MR-only pretraining baselines. This does not isolate the contribution of the spatial affinity component from the simple effect of exposing the SSL framework to a larger combined HR+MR data volume; controls such as joint batch training, multi-scale augmentations, or sequential HR-then-MR pretraining are required to substantiate the mechanism.
Simulated Author's Rebuttal
We thank the referee for their constructive feedback, which highlights important aspects of clarity and experimental rigor in our work on cross-scale pretraining for satellite imagery. We address each major comment point by point below, providing clarifications based on the manuscript's content and indicating planned revisions.
read point-by-point responses
-
Referee: Abstract: the claim that the spatial affinity component 'outperforms models pretrained on HR or MR images alone' is presented without any quantitative metrics, ablation results, dataset sizes, or error analysis, leaving the central empirical claim unsupported in the provided text.
Authors: We agree that the abstract would benefit from greater specificity to support the central claim. The full manuscript reports quantitative results in the experiments section, including mIoU improvements of 3-8% on downstream MR semantic segmentation tasks across two SSL frameworks and multiple datasets (with sizes such as approximately 50,000 MR images and 5,000 HR images used in pretraining). Ablation studies isolate the affinity component's contribution, and standard error bars are included. We will revise the abstract to incorporate key metrics, dataset details, and a concise reference to the ablation findings. revision: yes
-
Referee: Experiments (implied by abstract description): comparisons are limited to HR-only and MR-only pretraining baselines. This does not isolate the contribution of the spatial affinity component from the simple effect of exposing the SSL framework to a larger combined HR+MR data volume; controls such as joint batch training, multi-scale augmentations, or sequential HR-then-MR pretraining are required to substantiate the mechanism.
Authors: We appreciate this observation on isolating the mechanism. The spatial affinity component is designed such that HR data informs MR representations via affinity maps rather than simply increasing overall data volume; the number of MR samples is held constant across all settings, and we include a combined HR+MR pretraining baseline without the affinity module to control for exposure effects. That said, we acknowledge that explicit controls like joint batch training or sequential pretraining would further strengthen the claims. We will add a dedicated discussion subsection addressing data-volume confounds and include results from at least one additional control (e.g., sequential HR-then-MR pretraining) in the revised version. revision: partial
Circularity Check
No significant circularity detected
full rationale
The paper introduces an additive spatial affinity component to existing SSL frameworks that incorporates HR imagery to improve MR representations, with empirical validation showing gains over HR-only and MR-only pretraining baselines on two frameworks. No load-bearing derivations, equations, or self-citations are described that reduce the claimed improvement to a definitional equivalence, fitted parameter renamed as prediction, or self-referential uniqueness theorem. The method is presented as an independent module whose value is assessed through direct comparison to scale-specific baselines, rendering the chain self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Existing self-supervised learning frameworks can be extended with a spatial affinity component that uses cross-scale imagery to improve mid-resolution representations
invented entities (1)
-
spatial affinity component
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Y . Wang, N. A. A. Braham, Z. Xiong, C. Liu, C. M. Albrecht, and X. X. Zhu, “Ssl4eo-s12: A large-scale multimodal, multi- temporal dataset for self-supervised learning in earth observa- tion [software and data sets],”IEEE Geoscience and Remote Sensing Magazine, vol. 11, no. 3, pp. 98–106, 2023
work page 2023
-
[2]
Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery,
Y . Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y . He, M. Burke, D. Lobell, and S. Ermon, “Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery,”Advances in Neural Information Processing Systems, vol. 35, pp. 197–211, 2022
work page 2022
-
[3]
The Effects of Super- Resolution on Object Detection Performance in Satellite Imageryimagery,
J. Shermeyer and A. Van Etten, “The Effects of Super- Resolution on Object Detection Performance in Satellite Imageryimagery,”IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2019-June, pp. 1432–1441, 6 2019. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9025375
-
[4]
Scale- mae: A scale-aware masked autoencoder for multiscale geospa- tial representation learning,
C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp, K. Keutzer, S. Candido, M. Uyttendaele, and T. Darrell, “Scale- mae: A scale-aware masked autoencoder for multiscale geospa- tial representation learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4088– 4099
work page 2023
-
[5]
Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning,
V . Nedungadi, A. Kariryaa, S. Oehmcke, S. Belongie, C. Igel, and N. Lang, “Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 164–182
work page 2024
-
[6]
Prithvi-eo-2.0: A versatile multi- temporal foundation model for earth observation applications,
D. Szwarcman, S. Roy, P. Fraccaro, O. E. G ´ıslason, B. Blu- menstiel, R. Ghosal, P. H. De Oliveira, J. L. de Sousa Almeida, R. Sedona, Y . Kanget al., “Prithvi-eo-2.0: A versatile multi- temporal foundation model for earth observation applications,” IEEE Transactions on Geoscience and Remote Sensing, 2025
work page 2025
-
[7]
Galileo: Learning global & local features of many remote sensing modalities,
G. Tseng, A. Fuller, M. Reil, H. Herzog, P. Beukema, F. Bastani, J. R. Green, E. Shelhamer, H. Kerner, and D. Rolnick, “Galileo: Learning global & local features of many remote sensing modalities,”arXiv preprint arXiv:2502.09356, 2025
-
[8]
C. F. Brown, M. R. Kazmierski, V . J. Pasquarella, W. J. Rucklidge, M. Samsikova, C. Zhang, E. Shelhamer, E. Lahera, O. Wiles, S. Ilyushchenkoet al., “Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data,”arXiv preprint arXiv:2507.22291, 2025
work page internal anchor Pith review arXiv 2025
-
[9]
TerraMind: Large-scale generative multimodality for Earth observation,
J. Jakubik, F. Yang, B. Blumenstiel, E. Scheurer, R. Sedona, S. Maurogiovanni, J. Bosmans, N. Dionelis, V . Marsocci, N. Koppet al., “Terramind: Large-scale generative multimodal- ity for earth observation,”arXiv preprint arXiv:2504.11171, 2025
-
[10]
Sen2venµs, a dataset for the training of sentinel-2 super- resolution algorithms,
J. Michel, J. Vinasco-Salinas, J. Inglada, and O. Hagolle, “Sen2venµs, a dataset for the training of sentinel-2 super- resolution algorithms,”Data, vol. 7, no. 7, 2022. [Online]. Available: https://www.mdpi.com/2306-5729/7/7/96
work page 2022
-
[11]
O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa et al., “Dinov3,”arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[12]
Hyperspectral Imagery Classification Based on Contrastive Learning,
S. Hou, H. Shi, X. Cao, X. Zhang, and L. Jiao, “Hyperspectral Imagery Classification Based on Contrastive Learning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60,
-
[13]
Available: https://ieeexplore.ieee.org/document/ 9664575
[Online]. Available: https://ieeexplore.ieee.org/document/ 9664575
-
[14]
Geography-Aware Self-Supervised Learning,
K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke, D. Lobell, and S. Ermon, “Geography-Aware Self-Supervised Learning,”Proceedings of the IEEE International Conference on Computer Vision, pp. 10 161–10 170, 2021. [Online]. Avail- able: https://ieeexplore.ieee.org/abstract/document/9711401
-
[15]
Tile2Vec: Unsupervised Representation Learning for Spatially Distributed Data,
N. Jean, S. Wang, A. Samar, G. Azzari, D. Lobell, and S. Ermon, “Tile2Vec: Unsupervised Representation Learning for Spatially Distributed Data,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 3967–3974, 7 2019. [Online]. Available: https://ojs.aaai.org/ index.php/AAAI/article/view/4288
work page 2019
-
[16]
Multilabel-Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining,
Y . Wang, C. M. Albrecht, and X. X. Zhu, “Multilabel-Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, 2024. [Online]. Available: https://ieeexplore. ieee.org/abstract/document/10726860
-
[17]
A General Self-Supervised Framework for Remote Sensing Image Classification,
Y . Gao, X. Sun, and C. Liu, “A General Self-Supervised Framework for Remote Sensing Image Classification,”Remote Sensing 2022, Vol. 14, Page 4824, vol. 14, no. 19, p. 4824, 9
work page 2022
-
[18]
Available: https://www.mdpi.com/2072-4292/ 14/19/4824/htmhttps://www.mdpi.com/2072-4292/14/19/4824
[Online]. Available: https://www.mdpi.com/2072-4292/ 14/19/4824/htmhttps://www.mdpi.com/2072-4292/14/19/4824
work page 2072
-
[19]
Cross-Scale MAE: A Tale of Multiscale Exploitation in Re- mote Sensing,
M. Tang, A. Cozma, K. Georgiou, H. Qi, and M. H. Kao, “Cross-Scale MAE: A Tale of Multiscale Exploitation in Re- mote Sensing,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 054–20 066, 12 2023
work page 2023
-
[20]
Masked autoencoders are scalable vision learners,
K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009
work page 2022
-
[21]
Towards latent masked image modeling for self-supervised visual representation learn- ing,
Y . Wei, A. Gupta, and P. Morgado, “Towards latent masked image modeling for self-supervised visual representation learn- ing,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 1–17
work page 2024
-
[22]
G. Christie, N. Fendley, J. Wilson, and R. Mukherjee, “Functional Map of the World,”Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6172–6180, 12 2018. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8578744
-
[23]
C. Chen, Z. Xiong, X. Tian, Z. J. Zha, and F. Wu, “Camera lens super-resolution,”Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2019-June, pp. 1652–1660, 6 2019. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8954317
-
[24]
Toward real-world single image super-resolution: A new benchmark and a new model,
J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang, “Toward real-world single image super-resolution: A new benchmark and a new model,”Proceedings of the IEEE International Conference on Computer Vision, pp. 3086– 3095, 10 2019. [Online]. Available: https://ieeexplore.ieee.org/ abstract/document/9009805
-
[25]
Perceptual generative adversarial networks for small object detection,
J. Li, X. Liang, Y . Wei, T. Xu, J. Feng, and S. Yan, “Perceptual generative adversarial networks for small object detection,”Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017- January, pp. 1951–1959, 11 2017. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8099694
-
[26]
Target-Guided Feature Super-Resolution for Vehicle Detection in Remote Sensing Images,
J. Li, Z. Zhang, Y . Tian, Y . Xu, Y . Wen, and S. Wang, “Target-Guided Feature Super-Resolution for Vehicle Detection in Remote Sensing Images,”IEEE Geoscience and Remote Sensing Letters, vol. 19, 2021. [Online]. Available: https: //ieeexplore.ieee.org/document/9548683
-
[27]
Self-supervised learning from images with a joint-embedding predictive architecture,
M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 619–15 629
work page 2023
-
[28]
Emerging properties in self- supervised vision transformers,
M. Caron, H. Touvron, I. Misra, H. J’egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self- supervised vision transformers,”2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9630–9640, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID: 233444273
work page 2021
-
[29]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
A. Dosovitskiy, A. Kolesnikov, D. Weissenborn, G. Heigold, J. Uszkoreit, L. Beyer, M. Minderer, M. Dehghani, N. Houlsby, S. Gelly, T. Unterthiner, and X. Zhai, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[30]
GEO-Bench: Toward Foundation Models for Earth Monitoring,
A. Lacoste, N. Lehmann, P. Rodriguez, E. D. Sherwin, H. Kerner, B. L ¨utjens, J. Irvin, D. Dao, H. Alemohammad, A. Drouin, M. Gunturkun, G. Huang, D. Vazquez, D. Newman, Y . Bengio, S. Ermon, and X. X. Zhu, “GEO-Bench: Toward Foundation Models for Earth Monitoring,”Advances in Neural Information Processing Systems, vol. 36, pp. 51 080–51 093, 12 2023. [On...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.