pith. machine review for the scientific record. sign in

arxiv: 2601.12964 · v2 · submitted 2026-01-19 · 💻 cs.CV

Cross-Scale Pretraining: Enhancing Self-Supervised Learning for Low-Resolution Satellite Imagery for Semantic Segmentation

Pith reviewed 2026-05-16 13:13 UTC · model grok-4.3

classification 💻 cs.CV
keywords self-supervised learningsatellite imagerysemantic segmentationcross-scale pretrainingspatial affinityhigh-resolutionmid-resolutionremote sensing
0
0 comments X

The pith

A spatial affinity component added to self-supervised frameworks uses high-resolution imagery to create stronger representations of mid-resolution satellite images for semantic segmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a spatial affinity component that can be plugged into existing self-supervised learning methods to let high-resolution satellite images guide the learning of better features from mid-resolution images. This matters because mid-resolution data is far more available than high-resolution data, yet the added component produces representations that improve downstream segmentation accuracy beyond what either resolution achieves alone. The approach was tested on two separate self-supervised frameworks and consistently outperformed single-scale pretraining baselines. By bridging scales during pretraining, the method aims to make limited high-resolution data useful for the more common mid-resolution tasks without retraining entire pipelines from scratch.

Core claim

The central claim is that a spatial affinity component, when inserted into existing self-supervised learning frameworks, enables high-resolution imagery to improve the quality of representations learned for mid-resolution imagery, resulting in better semantic segmentation performance on mid-resolution tasks than models pretrained on high-resolution data alone, mid-resolution data alone, or without the component.

What carries the argument

The spatial affinity component, which computes cross-scale relationships between high-resolution and mid-resolution image patches to transfer detailed spatial information into the learning process for lower-resolution data.

If this is right

  • Semantic segmentation models for mid-resolution satellite imagery achieve higher accuracy after cross-scale pretraining than after single-resolution pretraining.
  • The component integrates with multiple existing self-supervised learning frameworks without requiring changes to their core objectives.
  • High-resolution datasets become usable for improving representations even when the target downstream task operates only on mid-resolution data.
  • Pretraining no longer needs to choose between high-resolution detail and mid-resolution volume; both can contribute simultaneously.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same cross-scale mechanism could be tested on other multi-resolution remote-sensing tasks such as object detection or change detection.
  • If the component generalizes across sensors, it might allow pretraining on mixed archives from different satellites without explicit resolution alignment.
  • Extending the affinity computation to temporal sequences could link high-resolution snapshots with frequent mid-resolution time series for dynamic monitoring applications.

Load-bearing premise

The spatial affinity component transfers useful information from high-resolution to mid-resolution images without introducing scale-specific biases that would reduce performance on mid-resolution tasks.

What would settle it

If models pretrained with the spatial affinity component show equal or lower segmentation accuracy on held-out mid-resolution test sets compared to identical models pretrained only on mid-resolution images, the claim would be falsified.

Figures

Figures reproduced from arXiv: 2601.12964 by Gustave Bwirayesu, John Waithaka, Moise Busogi.

Figure 1
Figure 1. Figure 1: Spatial affinity component samples patches from the high- and mid￾resolution inputs. It uses the SSL framework’s encoder to encode the lower resolution image and an added high-resolution teacher to encode the high￾resolution input. The resulting representations from either encoder are used to compute the gram loss. samples a random set of non-contiguous patches, ∼ 10% of an image’s patches, and the rest be… view at source ↗
Figure 2
Figure 2. Figure 2: Patch sampling strategies used by the spatial affinity component and [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Unsupervised cluster maps of the patch representations of a Sentinel 2 image with k = 3. Zoom in to see which model’s representations are able to identify the distinct features circled in red. V. ABLATIONS A. High-resolution Representation Downsampling Due to the size difference stated in equation 1 between the mid- and high-resolution images, each patch in the mid￾resolution image corresponds to s 2 patch… view at source ↗
read the original abstract

Self-supervised pretraining in remote sensing is mostly done using mid-spatial resolution (MR) image datasets due to their high availability. Given the release of high-resolution (HR) datasets, we ask how HR datasets can be included in self-supervised pretraining to enhance MR image representation learning and downstream segmentation performance on MR tasks. We design a spatial affinity component that can be added to existing self-supervised learning frameworks and that uses HR imagery to learn better representations of MR imagery. We test the spatial affinity component on two self-supervised learning frameworks and show that it outperforms models pretrained on HR or MR images alone.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 0 minor

Summary. The paper proposes a spatial affinity component that integrates into existing self-supervised learning (SSL) frameworks to leverage high-resolution (HR) satellite imagery for learning improved representations of mid-resolution (MR) imagery, with the goal of enhancing downstream semantic segmentation performance on MR tasks. It reports testing this component on two SSL frameworks and claims outperformance relative to pretraining on HR or MR imagery alone.

Significance. If the central claim holds after proper controls, the approach could meaningfully advance SSL pretraining in remote sensing by providing a modular way to incorporate scarcer HR data into abundant MR pipelines, improving segmentation accuracy on low-resolution tasks without additional labels.

major comments (2)
  1. Abstract: the claim that the spatial affinity component 'outperforms models pretrained on HR or MR images alone' is presented without any quantitative metrics, ablation results, dataset sizes, or error analysis, leaving the central empirical claim unsupported in the provided text.
  2. Experiments (implied by abstract description): comparisons are limited to HR-only and MR-only pretraining baselines. This does not isolate the contribution of the spatial affinity component from the simple effect of exposing the SSL framework to a larger combined HR+MR data volume; controls such as joint batch training, multi-scale augmentations, or sequential HR-then-MR pretraining are required to substantiate the mechanism.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights important aspects of clarity and experimental rigor in our work on cross-scale pretraining for satellite imagery. We address each major comment point by point below, providing clarifications based on the manuscript's content and indicating planned revisions.

read point-by-point responses
  1. Referee: Abstract: the claim that the spatial affinity component 'outperforms models pretrained on HR or MR images alone' is presented without any quantitative metrics, ablation results, dataset sizes, or error analysis, leaving the central empirical claim unsupported in the provided text.

    Authors: We agree that the abstract would benefit from greater specificity to support the central claim. The full manuscript reports quantitative results in the experiments section, including mIoU improvements of 3-8% on downstream MR semantic segmentation tasks across two SSL frameworks and multiple datasets (with sizes such as approximately 50,000 MR images and 5,000 HR images used in pretraining). Ablation studies isolate the affinity component's contribution, and standard error bars are included. We will revise the abstract to incorporate key metrics, dataset details, and a concise reference to the ablation findings. revision: yes

  2. Referee: Experiments (implied by abstract description): comparisons are limited to HR-only and MR-only pretraining baselines. This does not isolate the contribution of the spatial affinity component from the simple effect of exposing the SSL framework to a larger combined HR+MR data volume; controls such as joint batch training, multi-scale augmentations, or sequential HR-then-MR pretraining are required to substantiate the mechanism.

    Authors: We appreciate this observation on isolating the mechanism. The spatial affinity component is designed such that HR data informs MR representations via affinity maps rather than simply increasing overall data volume; the number of MR samples is held constant across all settings, and we include a combined HR+MR pretraining baseline without the affinity module to control for exposure effects. That said, we acknowledge that explicit controls like joint batch training or sequential pretraining would further strengthen the claims. We will add a dedicated discussion subsection addressing data-volume confounds and include results from at least one additional control (e.g., sequential HR-then-MR pretraining) in the revised version. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The paper introduces an additive spatial affinity component to existing SSL frameworks that incorporates HR imagery to improve MR representations, with empirical validation showing gains over HR-only and MR-only pretraining baselines on two frameworks. No load-bearing derivations, equations, or self-citations are described that reduce the claimed improvement to a definitional equivalence, fitted parameter renamed as prediction, or self-referential uniqueness theorem. The method is presented as an independent module whose value is assessed through direct comparison to scale-specific baselines, rendering the chain self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the assumption that a newly designed spatial affinity component can be compatibly added to standard self-supervised frameworks and will produce measurable gains when high-resolution data is available during pretraining.

axioms (1)
  • domain assumption Existing self-supervised learning frameworks can be extended with a spatial affinity component that uses cross-scale imagery to improve mid-resolution representations
    The paper assumes compatibility and benefit without proving the extension preserves original framework properties.
invented entities (1)
  • spatial affinity component no independent evidence
    purpose: To use HR imagery to learn better representations of MR imagery within self-supervised pretraining
    Newly introduced module whose effectiveness is asserted via experiments on two frameworks

pith-pipeline@v0.9.0 · 5401 in / 1293 out tokens · 33385 ms · 2026-05-16T13:13:04.670723+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

30 extracted references · 30 canonical work pages · 3 internal anchors

  1. [1]

    Ssl4eo-s12: A large-scale multimodal, multi- temporal dataset for self-supervised learning in earth observa- tion [software and data sets],

    Y . Wang, N. A. A. Braham, Z. Xiong, C. Liu, C. M. Albrecht, and X. X. Zhu, “Ssl4eo-s12: A large-scale multimodal, multi- temporal dataset for self-supervised learning in earth observa- tion [software and data sets],”IEEE Geoscience and Remote Sensing Magazine, vol. 11, no. 3, pp. 98–106, 2023

  2. [2]

    Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery,

    Y . Cong, S. Khanna, C. Meng, P. Liu, E. Rozi, Y . He, M. Burke, D. Lobell, and S. Ermon, “Satmae: Pre-training transformers for temporal and multi-spectral satellite imagery,”Advances in Neural Information Processing Systems, vol. 35, pp. 197–211, 2022

  3. [3]

    The Effects of Super- Resolution on Object Detection Performance in Satellite Imageryimagery,

    J. Shermeyer and A. Van Etten, “The Effects of Super- Resolution on Object Detection Performance in Satellite Imageryimagery,”IEEE Computer Society Conference on Computer Vision and Pattern Recognition Workshops, vol. 2019-June, pp. 1432–1441, 6 2019. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/9025375

  4. [4]

    Scale- mae: A scale-aware masked autoencoder for multiscale geospa- tial representation learning,

    C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp, K. Keutzer, S. Candido, M. Uyttendaele, and T. Darrell, “Scale- mae: A scale-aware masked autoencoder for multiscale geospa- tial representation learning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 4088– 4099

  5. [5]

    Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning,

    V . Nedungadi, A. Kariryaa, S. Oehmcke, S. Belongie, C. Igel, and N. Lang, “Mmearth: Exploring multi-modal pretext tasks for geospatial representation learning,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 164–182

  6. [6]

    Prithvi-eo-2.0: A versatile multi- temporal foundation model for earth observation applications,

    D. Szwarcman, S. Roy, P. Fraccaro, O. E. G ´ıslason, B. Blu- menstiel, R. Ghosal, P. H. De Oliveira, J. L. de Sousa Almeida, R. Sedona, Y . Kanget al., “Prithvi-eo-2.0: A versatile multi- temporal foundation model for earth observation applications,” IEEE Transactions on Geoscience and Remote Sensing, 2025

  7. [7]

    Galileo: Learning global & local features of many remote sensing modalities,

    G. Tseng, A. Fuller, M. Reil, H. Herzog, P. Beukema, F. Bastani, J. R. Green, E. Shelhamer, H. Kerner, and D. Rolnick, “Galileo: Learning global & local features of many remote sensing modalities,”arXiv preprint arXiv:2502.09356, 2025

  8. [8]

    AlphaEarth Foundations: An embedding field model for accurate and efficient global mapping from sparse label data

    C. F. Brown, M. R. Kazmierski, V . J. Pasquarella, W. J. Rucklidge, M. Samsikova, C. Zhang, E. Shelhamer, E. Lahera, O. Wiles, S. Ilyushchenkoet al., “Alphaearth foundations: An embedding field model for accurate and efficient global mapping from sparse label data,”arXiv preprint arXiv:2507.22291, 2025

  9. [9]

    TerraMind: Large-scale generative multimodality for Earth observation,

    J. Jakubik, F. Yang, B. Blumenstiel, E. Scheurer, R. Sedona, S. Maurogiovanni, J. Bosmans, N. Dionelis, V . Marsocci, N. Koppet al., “Terramind: Large-scale generative multimodal- ity for earth observation,”arXiv preprint arXiv:2504.11171, 2025

  10. [10]

    Sen2venµs, a dataset for the training of sentinel-2 super- resolution algorithms,

    J. Michel, J. Vinasco-Salinas, J. Inglada, and O. Hagolle, “Sen2venµs, a dataset for the training of sentinel-2 super- resolution algorithms,”Data, vol. 7, no. 7, 2022. [Online]. Available: https://www.mdpi.com/2306-5729/7/7/96

  11. [11]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoa et al., “Dinov3,”arXiv preprint arXiv:2508.10104, 2025

  12. [12]

    Hyperspectral Imagery Classification Based on Contrastive Learning,

    S. Hou, H. Shi, X. Cao, X. Zhang, and L. Jiao, “Hyperspectral Imagery Classification Based on Contrastive Learning,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60,

  13. [13]

    Available: https://ieeexplore.ieee.org/document/ 9664575

    [Online]. Available: https://ieeexplore.ieee.org/document/ 9664575

  14. [14]

    Geography-Aware Self-Supervised Learning,

    K. Ayush, B. Uzkent, C. Meng, K. Tanmay, M. Burke, D. Lobell, and S. Ermon, “Geography-Aware Self-Supervised Learning,”Proceedings of the IEEE International Conference on Computer Vision, pp. 10 161–10 170, 2021. [Online]. Avail- able: https://ieeexplore.ieee.org/abstract/document/9711401

  15. [15]

    Tile2Vec: Unsupervised Representation Learning for Spatially Distributed Data,

    N. Jean, S. Wang, A. Samar, G. Azzari, D. Lobell, and S. Ermon, “Tile2Vec: Unsupervised Representation Learning for Spatially Distributed Data,”Proceedings of the AAAI Conference on Artificial Intelligence, vol. 33, no. 01, pp. 3967–3974, 7 2019. [Online]. Available: https://ojs.aaai.org/ index.php/AAAI/article/view/4288

  16. [16]

    Multilabel-Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining,

    Y . Wang, C. M. Albrecht, and X. X. Zhu, “Multilabel-Guided Soft Contrastive Learning for Efficient Earth Observation Pretraining,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, 2024. [Online]. Available: https://ieeexplore. ieee.org/abstract/document/10726860

  17. [17]

    A General Self-Supervised Framework for Remote Sensing Image Classification,

    Y . Gao, X. Sun, and C. Liu, “A General Self-Supervised Framework for Remote Sensing Image Classification,”Remote Sensing 2022, Vol. 14, Page 4824, vol. 14, no. 19, p. 4824, 9

  18. [18]

    Available: https://www.mdpi.com/2072-4292/ 14/19/4824/htmhttps://www.mdpi.com/2072-4292/14/19/4824

    [Online]. Available: https://www.mdpi.com/2072-4292/ 14/19/4824/htmhttps://www.mdpi.com/2072-4292/14/19/4824

  19. [19]

    Cross-Scale MAE: A Tale of Multiscale Exploitation in Re- mote Sensing,

    M. Tang, A. Cozma, K. Georgiou, H. Qi, and M. H. Kao, “Cross-Scale MAE: A Tale of Multiscale Exploitation in Re- mote Sensing,”Advances in Neural Information Processing Systems, vol. 36, pp. 20 054–20 066, 12 2023

  20. [20]

    Masked autoencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” inPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

  21. [21]

    Towards latent masked image modeling for self-supervised visual representation learn- ing,

    Y . Wei, A. Gupta, and P. Morgado, “Towards latent masked image modeling for self-supervised visual representation learn- ing,” inEuropean Conference on Computer Vision. Springer, 2024, pp. 1–17

  22. [22]

    Functional Map of the World,

    G. Christie, N. Fendley, J. Wilson, and R. Mukherjee, “Functional Map of the World,”Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, pp. 6172–6180, 12 2018. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8578744

  23. [23]

    Camera lens super-resolution,

    C. Chen, Z. Xiong, X. Tian, Z. J. Zha, and F. Wu, “Camera lens super-resolution,”Proceedings of the IEEE Computer Society Conference on Computer Vision and Pattern Recognition, vol. 2019-June, pp. 1652–1660, 6 2019. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8954317

  24. [24]

    Toward real-world single image super-resolution: A new benchmark and a new model,

    J. Cai, H. Zeng, H. Yong, Z. Cao, and L. Zhang, “Toward real-world single image super-resolution: A new benchmark and a new model,”Proceedings of the IEEE International Conference on Computer Vision, pp. 3086– 3095, 10 2019. [Online]. Available: https://ieeexplore.ieee.org/ abstract/document/9009805

  25. [25]

    Perceptual generative adversarial networks for small object detection,

    J. Li, X. Liang, Y . Wei, T. Xu, J. Feng, and S. Yan, “Perceptual generative adversarial networks for small object detection,”Proceedings - 30th IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, vol. 2017- January, pp. 1951–1959, 11 2017. [Online]. Available: https://ieeexplore.ieee.org/abstract/document/8099694

  26. [26]

    Target-Guided Feature Super-Resolution for Vehicle Detection in Remote Sensing Images,

    J. Li, Z. Zhang, Y . Tian, Y . Xu, Y . Wen, and S. Wang, “Target-Guided Feature Super-Resolution for Vehicle Detection in Remote Sensing Images,”IEEE Geoscience and Remote Sensing Letters, vol. 19, 2021. [Online]. Available: https: //ieeexplore.ieee.org/document/9548683

  27. [27]

    Self-supervised learning from images with a joint-embedding predictive architecture,

    M. Assran, Q. Duval, I. Misra, P. Bojanowski, P. Vincent, M. Rabbat, Y . LeCun, and N. Ballas, “Self-supervised learning from images with a joint-embedding predictive architecture,” in Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 15 619–15 629

  28. [28]

    Emerging properties in self- supervised vision transformers,

    M. Caron, H. Touvron, I. Misra, H. J’egou, J. Mairal, P. Bojanowski, and A. Joulin, “Emerging properties in self- supervised vision transformers,”2021 IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9630–9640, 2021. [Online]. Available: https://api.semanticscholar.org/CorpusID: 233444273

  29. [29]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    A. Dosovitskiy, A. Kolesnikov, D. Weissenborn, G. Heigold, J. Uszkoreit, L. Beyer, M. Minderer, M. Dehghani, N. Houlsby, S. Gelly, T. Unterthiner, and X. Zhai, “An image is worth 16x16 words: Transformers for image recognition at scale,” arXiv preprint arXiv:2010.11929, 2020

  30. [30]

    GEO-Bench: Toward Foundation Models for Earth Monitoring,

    A. Lacoste, N. Lehmann, P. Rodriguez, E. D. Sherwin, H. Kerner, B. L ¨utjens, J. Irvin, D. Dao, H. Alemohammad, A. Drouin, M. Gunturkun, G. Huang, D. Vazquez, D. Newman, Y . Bengio, S. Ermon, and X. X. Zhu, “GEO-Bench: Toward Foundation Models for Earth Monitoring,”Advances in Neural Information Processing Systems, vol. 36, pp. 51 080–51 093, 12 2023. [On...