Developing a foundation model for high-resolution remote sensing data of the Netherlands

Heysem Kaya; Paul Vermeeren

arxiv: 2605.10184 · v1 · submitted 2026-05-11 · 💻 cs.CV · cs.AI

Developing a foundation model for high-resolution remote sensing data of the Netherlands

Paul Vermeeren , Heysem Kaya This is my paper

Pith reviewed 2026-05-12 04:08 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords foundation modelremote sensingsatellite imagerytemporal dataVision TransformerCNNNetherlandsgeneralization

0 comments

The pith

A foundation model trained solely on Dutch high-resolution satellite images achieves competitive results on global benchmarks with a smaller model and less data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper develops a foundation model pretrained on 1.2 million high-resolution satellite images limited to the Netherlands. It uses a hybrid architecture that pairs convolutional networks with vision transformers and adds temporal sequences as input to capture fine textures alongside large terrain patterns and time-dependent changes. The model is tested on local vegetation monitoring and multiple global benchmarking datasets. It records measurable gains from the temporal inputs on the Dutch task and matches larger state-of-the-art models on the international ones despite its reduced size and geographically narrow pretraining. A sympathetic reader would see this as evidence that rich, reusable representations can be obtained without global-scale data or compute.

Core claim

The authors show that a hybrid CNN-ViT model trained on 1.2 million Netherlands high-resolution satellite images, when supplied with temporal sequences, learns representations that reduce feature ambiguity through topographic, land-cover, and seasonal constraints. This yields clear accuracy improvements on the Netherlands vegetation monitoring dataset when multiple time points are used instead of a single image, and produces competitive performance on global remote-sensing benchmarks relative to larger state-of-the-art models, all while employing fewer parameters and training data confined to one country.

What carries the argument

A hybrid CNN-ViT architecture that processes low- and high-frequency landscape features through convolutional and transformer layers while taking temporal image sequences as input to exploit dependencies across time.

If this is right

Incorporating temporal data produces clear performance gains on the Netherlands vegetation monitoring task compared with single-timepoint inputs.
The learned representations achieve competitive accuracy on global benchmarking datasets despite the model being smaller and pretrained only on Netherlands imagery.
Temporal constraints allow richer representations and better generalization when labeled data for downstream tasks is scarce.
Public release of the model weights and training scripts supports direct reuse and further experimentation.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

High-quality regional imagery paired with temporal context may prove sufficient for many global remote-sensing tasks, reducing the need for massive worldwide pretraining corpora.
Similar temporally-augmented training could be tested in other countries or sensor domains to create efficient localized foundation models.
Applications facing scarce labels might benefit from adopting multi-temporal inputs as a lightweight way to strengthen representations without increasing model scale.

Load-bearing premise

That supplying temporal sequences as input will reliably reduce feature ambiguity and improve generalization on downstream tasks even when labeled samples are limited.

What would settle it

A controlled ablation in which the temporal-input version shows no accuracy gain over an otherwise identical single-timepoint model on the vegetation monitoring task or fails to reach competitive scores on the global benchmarks.

Figures

Figures reproduced from arXiv: 2605.10184 by Heysem Kaya, Paul Vermeeren.

**Figure 2.** Figure 2: The advantage of this hybrid structure is that CNNs [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 2.** Figure 2: The composed architecture for the pretraining the model including a 4D patch embedding inspired by SpectralGPT [9], and four patch merging steps [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Example of applying low- and high-pass filters to a pre-training sample which is shown on the left. The right shows the output after applying the [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: A dataset sample from the vegetation monitoring dataset containing [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

read the original abstract

We develop a foundation model using 1.2m high resolution satellite images of the Netherlands. By combining a Convolutional Neural Network and a Vision Transformer, the model captures both low- and high-frequency landscape features, such as fine textures, edges, and small objects as well as large terrain structures, elevation patterns, and land-cover distributions. Leveraging temporal data as input, the model learns from broader contextual information across time, allowing the model to exploit the temporal dependencies, such as topographic features, land-cover changes, and seasonal dynamics. These additional constraints reduce feature ambiguity, improve representation learning, and enable better generalization with fewer labeled samples. The foundation model is evaluated on multiple downstream tasks, ranging from use cases within the Netherlands to global benchmarking datasets. On the vegetation monitoring dataset of the Netherlands, the model shows clear performance improvements by incorporating temporal information instead of relying on a single time point. Despite using a smaller model and less pretraining data limited to the Netherlands, it achieves competitive results on global benchmarks when compared to state-of-the-art models. These results demonstrate that the model can learn rich, generalizable representations from limited data, achieving competitive performance on global benchmarks while using a fraction of the parameters of larger state-of-the-art remote sensing models. To maximize reproducibility and reuse, we made the scripts and the model accessible on GitHub.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Dutch-only pretraining with temporal inputs claims competitive global results but the abstract supplies no numbers, baselines, or ablations to support the generalization story.

read the letter

The main point is that the authors pretrain a CNN-ViT hybrid on 1.2 million high-resolution Dutch satellite images, feed in temporal channels to capture seasonal and change patterns, and then report that the resulting model matches or approaches larger global foundation models on downstream benchmarks despite using far less data and a smaller architecture. They also release the code and weights, which is straightforward and helpful for reuse. That combination of national-scale data plus temporal context is a reasonable practical move for anyone who needs representations tuned to temperate, flat agricultural landscapes rather than global diversity. The vegetation monitoring result inside the Netherlands is the one place where they explicitly tie temporal input to a performance lift, which at least gives a concrete anchor for the temporal claim. The rest of the evaluation story is thinner. The abstract asserts competitive global numbers and better generalization with fewer labels, yet it contains no tables, no parameter counts for the compared models, no list of the benchmark datasets, no error bars, and no ablation that isolates temporal channels from single-frame inputs. Without those details it is impossible to judge how much domain shift actually occurs when Dutch-trained features are tested on global tasks or whether the reported gains survive proper statistical checks. The assumption that temporal sequences reliably cut feature ambiguity across limited labels is stated but not stress-tested in the provided text. This work is aimed at remote-sensing groups that want efficient, regionally scoped models rather than the next universal foundation model. It is worth sending to peer review so the methods, exact numbers, and transfer experiments can be examined in full; the core setup is concrete enough to merit that step even if the claims will need tightening.

Referee Report

2 major / 1 minor

Summary. The manuscript presents a foundation model for high-resolution remote sensing imagery of the Netherlands, pretrained on 1.2 million images using a hybrid CNN-ViT architecture that processes temporal sequences. The model is claimed to capture both fine-grained textures and large-scale landscape structures, with temporal inputs reducing feature ambiguity and improving generalization on downstream tasks. Evaluations include vegetation monitoring within the Netherlands (showing gains from temporal data) and global benchmarking datasets, where competitive performance is reported despite a smaller model size and pretraining data restricted to the Netherlands. Scripts and the model are released on GitHub for reproducibility.

Significance. If substantiated with detailed quantitative evidence, the work would indicate that regionally focused pretraining can yield generalizable representations for remote sensing, potentially lowering barriers to foundation model development by demonstrating effective use of temporal context and smaller architectures. The open release of code and weights supports reproducibility and reuse in the field.

major comments (2)

[Abstract and results sections] Abstract and results sections: The headline claim of achieving 'competitive results on global benchmarks' despite a smaller model and Netherlands-limited pretraining data is not accompanied by specific metrics (e.g., accuracy or mIoU scores), parameter counts of the compared SOTA models, or an enumerated list of the global datasets. This omission prevents assessment of domain-shift robustness from Dutch landscapes to diverse global domains.
[Evaluation on downstream tasks] Evaluation on downstream tasks: No ablation studies, statistical tests, or error bars are provided to isolate the contribution of temporal sequence inputs versus single-frame inputs, nor to quantify the claimed reduction in feature ambiguity and improved generalization with fewer labeled samples. This weakens support for the central architectural choice.

minor comments (1)

The description of the CNN-ViT hybrid would benefit from a diagram or explicit equations defining how low- and high-frequency features are fused across temporal inputs.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive feedback. We agree that additional quantitative details are needed to support our claims and have revised the manuscript accordingly to include specific metrics, model comparisons, dataset lists, and ablation analyses.

read point-by-point responses

Referee: [Abstract and results sections] Abstract and results sections: The headline claim of achieving 'competitive results on global benchmarks' despite a smaller model and Netherlands-limited pretraining data is not accompanied by specific metrics (e.g., accuracy or mIoU scores), parameter counts of the compared SOTA models, or an enumerated list of the global datasets. This omission prevents assessment of domain-shift robustness from Dutch landscapes to diverse global domains.

Authors: We agree that the abstract and results would benefit from explicit quantitative support. In the revised manuscript, we have added specific metrics (accuracy and mIoU scores) for the global benchmarks, parameter counts for the compared state-of-the-art models, and an enumerated list of the global datasets. These additions allow direct evaluation of domain-shift robustness from the Netherlands-only pretraining to diverse global domains. revision: yes
Referee: [Evaluation on downstream tasks] Evaluation on downstream tasks: No ablation studies, statistical tests, or error bars are provided to isolate the contribution of temporal sequence inputs versus single-frame inputs, nor to quantify the claimed reduction in feature ambiguity and improved generalization with fewer labeled samples. This weakens support for the central architectural choice.

Authors: We acknowledge that stronger empirical isolation of the temporal component is warranted. In the revised manuscript, we have added ablation studies directly comparing temporal sequence inputs to single-frame inputs on the vegetation monitoring task. These include statistical tests and error bars to quantify the contribution to reduced feature ambiguity and improved generalization with fewer labeled samples. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical claims rest on external benchmarks and held-out tasks

full rationale

The paper presents an empirical foundation model (CNN+ViT) pretrained on 1.2m Dutch satellite images and evaluated on downstream tasks including NL vegetation monitoring and global benchmarks. No equations, derivations, or self-citations are invoked that reduce reported performance metrics to quantities defined by the model's own fitted parameters or inputs by construction. The temporal-sequence benefit is asserted from observed improvements on held-out data rather than tautological redefinition, and the 'competitive with less data' claim is framed as a comparison against external SOTA results. The derivation chain is therefore self-contained against independent benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions that CNN-ViT hybrids can extract useful multi-scale features from satellite imagery and that temporal stacking reduces ambiguity without introducing new biases. No new entities are postulated and no free parameters are explicitly fitted in the abstract.

axioms (2)

domain assumption Hybrid CNN-ViT architectures extract both local textures and global structures from remote-sensing imagery.
Invoked when the authors state the model captures low- and high-frequency landscape features.
domain assumption Temporal sequences provide additional constraints that reduce feature ambiguity and improve generalization.
Central justification for using time as input; appears in the abstract description of temporal dependencies.

pith-pipeline@v0.9.0 · 5534 in / 1383 out tokens · 57977 ms · 2026-05-12T04:08:50.482360+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

By combining a Convolutional Neural Network and a Vision Transformer, the model captures both low- and high-frequency landscape features... Leveraging temporal data as input...
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

The architecture follows the Swin Transformer... four patch merging stages... CNN branch... frequency-domain training approach

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

38 extracted references · 38 canonical work pages

[1]

Szeliski,Computer vision: algorithms and applications

R. Szeliski,Computer vision: algorithms and applications. Springer Nature, 2022

work page 2022
[2]

Deep learning for computer vision: A brief review,

A. V oulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep learning for computer vision: A brief review,”Computational intelligence and neuroscience, vol. 2018, no. 1, p. 7068349, 2018

work page 2018
[3]

Transformers in vision: A survey,

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,”ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022

work page 2022
[4]

On the opportunities and risks of foundation models,

R. Bommasani, D. A. Hudson, E. Adeliet al., “On the opportunities and risks of foundation models,”arXiv preprint, 2022

work page 2022
[5]

Foundation models for generalist geospatial artificial intelligence,

J. Jakubik, S. Roy, C. E. Phillips, P. Fraccaro, D. Godwin, B. Zadrozny, D. Szwarcman, C. Gomes, G. Nyirjesy, B. Edwards, D. Kimura, N. Si- mumba, L. Chu, S. K. Mukkavilli, D. Lambhate, K. Das, R. Bangalore, 9 D. Oliveira, M. Muszynski, K. Ankur, M. Ramasubramanian, I. Gurung, S. Khallaghi, Hanxi, Li, M. Cecil, M. Ahmadi, F. Kordi, H. Alemoham- mad, M. Ma...

work page 2023
[6]

Satlaspretrain: A large-scale dataset for remote sensing image under- standing,

F. Bastani, P. Wolters, R. Gupta, J. Ferdinando, and A. Kembhavi, “Satlaspretrain: A large-scale dataset for remote sensing image under- standing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 772–16 782

work page 2023
[7]

Rethinking transformers pre-training for multi-spectral satellite imagery,

M. Noman, M. Naseer, H. Cholakkal, R. M. Anwer, S. Khan, and F. S. Khan, “Rethinking transformers pre-training for multi-spectral satellite imagery,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27 811–27 819

work page 2024
[8]

Neural plasticity-inspired foundation model for observing the earth crossing modalities,

Z. Xiong, Y . Wang, F. Zhang, A. J. Stewart, J. Hanna, D. Borth, I. Papoutsis, B. L. Saux, G. Camps-Valls, and X. X. Zhu, “Neural plasticity-inspired foundation model for observing the earth crossing modalities,”arXiv preprint, 2024

work page 2024
[9]

Spectralgpt: Spectral remote sensing foundation model,

D. Hong, B. Zhang, X. Li, Y . Li, C. Li, J. Yao, N. Yokoya, H. Li, P. Ghamisi, X. Jia, A. Plaza, P. Gamba, J. A. Benediktsson, and J. Chanussot, “Spectralgpt: Spectral remote sensing foundation model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, p. 5227–5244, Aug. 2024

work page 2024
[10]

Self-supervised material and texture representation learning for remote sensing tasks,

P. Akiva, M. Purri, and M. Leotta, “Self-supervised material and texture representation learning for remote sensing tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8203–8215

work page 2022
[11]

A con- volution with transformer attention module integrating local and global features for object detection in remote sensing based on YOLOv8n,

K. Lang, J. Cui, M. Yang, H. Wang, Z. Wang, and H. Shen, “A con- volution with transformer attention module integrating local and global features for object detection in remote sensing based on YOLOv8n,” Remote Sensing, vol. 16, no. 5, p. 906, 2024

work page 2024
[12]

Ringmo-lite: A remote sensing lightweight network with cnn-transformer hybrid framework,

Y . Wang, T. Zhang, L. Zhao, L. Hu, Z. Wang, Z. Niu, P. Cheng, K. Chen, X. Zeng, Z. Wang, H. Wang, and X. Sun, “Ringmo-lite: A remote sensing lightweight network with cnn-transformer hybrid framework,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1– 20, 2024

work page 2024
[13]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classi- fication,

P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classi- fication,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019

work page 2019
[14]

Remote sensing image scene classifi- cation: Benchmark and state of the art,

G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi- cation: Benchmark and state of the art,”Proceedings of the IEEE, vol. 105, no. 10, pp. 1865–1883, 2017

work page 2017
[15]

reben: Refined bigearthnet dataset for remote sensing image analysis,

K. N. Clasen, L. Hackel, T. Burgert, G. Sumbul, B. Demir, and V . Markl, “reben: Refined bigearthnet dataset for remote sensing image analysis,” 2025

work page 2025
[16]

Bag-of-visual-words and spatial extensions for land-use classification

Y . Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification.” New York, NY , USA: Association for Computing Machinery, 2010

work page 2010
[17]

Urban change detection for multispectral earth observation using convolutional neural networks,

R. Caye Daudt, B. Le Saux, A. Boulch, and Y . Gousseau, “Urban change detection for multispectral earth observation using convolutional neural networks,” inIEEE International Geoscience and Remote Sensing Symposium (IGARSS), July 2018

work page 2018
[18]

Consecutive pre-training: A knowledge transfer learning strategy with relevant unlabeled data for remote sensing domain,

T. Zhang, P. Gao, H. Dong, Y . Zhuang, G. Wang, W. Zhang, and H. Chen, “Consecutive pre-training: A knowledge transfer learning strategy with relevant unlabeled data for remote sensing domain,” Remote Sensing, vol. 14, no. 22, p. 5675, 2022

work page 2022
[19]

Unified perceptual parsing for scene understanding,

T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” inProceedings of the European Con- ference on Computer Vision (ECCV). Springer International Publishing, 2018, pp. 432–448

work page 2018
[20]

Rsbuilding: Towards general remote sensing image building extraction and change detection with foundation model,

M. Wang, L. Su, C. Yan, S. Xu, H. Zhang, P. Yuan, X. Jiang, and B. Zhang, “Rsbuilding: Towards general remote sensing image building extraction and change detection with foundation model,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–17, 2024

work page 2024
[21]

Segformer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021, pp. 12 077–12 090

work page 2021
[22]

Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery,

Y . Cong, S. Khannaet al., “Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 197–211, 2022

work page 2022
[23]

Global and local contrastive self-supervised learning for semantic segmentation of hr remote sensing images,

H. Li, Y . Li, G. Zhang, R. Liu, H. Huang, Q. Zhu, and C. Tao, “Global and local contrastive self-supervised learning for semantic segmentation of hr remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, p. 1–14, 2022

work page 2022
[24]

Swin MAE: masked autoencoders for small datasets,

Y . Dai, F. Liu, W. Chen, Y . Liu, L. Shi, S. Liu, Y . Zhouet al., “Swin MAE: masked autoencoders for small datasets,”Computers in biology and medicine, vol. 161, p. 107037, 2023

work page 2023
[25]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10 002

work page
[26]

2D Semantic Labeling Potsdam,

ISPRS, “2D Semantic Labeling Potsdam,” 2018. [Online]. Available: www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/ 2d-sem-label-potsdam.aspx

work page 2018
[27]

Spatial-temporal attention neural network for building change detection in remote sensing images,

H. Chen, Z. Shi, Z. Zhang, and X. Liu, “Spatial-temporal attention neural network for building change detection in remote sensing images,” Remote Sensing, vol. 12, no. 10, p. 1667, 2020

work page 2020
[28]

Towards geospatial foundation models via continual pretraining,

M. Mendieta, B. Han, X. Shi, Y . Zhu, and C. Chen, “Towards geospatial foundation models via continual pretraining,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 16 760–16 770

work page 2023
[29]

Aitlas-arena: Repository for remote sensing,

B. V . Labs, “Aitlas-arena: Repository for remote sensing,” https://github. com/biasvariancelabs/aitlas-arena, 2024, accessed: 2024-10-08

work page 2024
[30]

MMSegmentation: OpenMMLab Semantic Seg- mentation Toolbox and Benchmark,

MMS Contributors, “MMSegmentation: OpenMMLab Semantic Seg- mentation Toolbox and Benchmark,” https://github.com/open-mmlab/ mmsegmentation, 2020, accessed: 2024-10-08

work page 2020
[31]

Cmid: A unified self- supervised learning framework for remote sensing image understanding,

D. Muhtar, X. Zhang, P. Xiao, Z. Li, and F. Gu, “Cmid: A unified self- supervised learning framework for remote sensing image understanding,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1– 17, 2023

work page 2023
[32]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in2022 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 979–15 988

work page 2022
[33]

Convmae: Masked convolution meets masked autoencoders,

P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y . Qiao, “Convmae: Masked convolution meets masked autoencoders,”arXiv preprint, 2022

work page 2022
[34]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning,

C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp, K. Keutzer, S. Candido, M. Uyttendaele, and T. Darrell, “Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4065–4076

work page 2023
[35]

Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,

O. Ma ˜nas, A. Lacoste, X. Gir ´o-i Nieto, D. Vazquez, and P. Rodr ´ıguez, “Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9394–9403

work page 2021
[36]

A unet-like hybrid transformer for efficient semantic segmentation of remote sensing images,

S. Liu and Y . Zhao, “A unet-like hybrid transformer for efficient semantic segmentation of remote sensing images,” in2023 5th International Conference on Geoscience and Remote Sensing Mapping (GRSM), 2023, pp. 149–154

work page 2023
[37]

Cross-scale mae: A tale of multiscale exploitation in remote sensing,

M. Tang, A. Cozma, K. Georgiou, and H. Qi, “Cross-scale mae: A tale of multiscale exploitation in remote sensing,” inAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36, 2023, pp. 20 054– 20 066

work page 2023
[38]

A billion-scale foundation model for remote sensing images,

K. Cha, J. Seo, and T. Lee, “A billion-scale foundation model for remote sensing images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, p. 1–17, 2024

work page 2024

[1] [1]

Szeliski,Computer vision: algorithms and applications

R. Szeliski,Computer vision: algorithms and applications. Springer Nature, 2022

work page 2022

[2] [2]

Deep learning for computer vision: A brief review,

A. V oulodimos, N. Doulamis, A. Doulamis, and E. Protopapadakis, “Deep learning for computer vision: A brief review,”Computational intelligence and neuroscience, vol. 2018, no. 1, p. 7068349, 2018

work page 2018

[3] [3]

Transformers in vision: A survey,

S. Khan, M. Naseer, M. Hayat, S. W. Zamir, F. S. Khan, and M. Shah, “Transformers in vision: A survey,”ACM computing surveys (CSUR), vol. 54, no. 10s, pp. 1–41, 2022

work page 2022

[4] [4]

On the opportunities and risks of foundation models,

R. Bommasani, D. A. Hudson, E. Adeliet al., “On the opportunities and risks of foundation models,”arXiv preprint, 2022

work page 2022

[5] [5]

Foundation models for generalist geospatial artificial intelligence,

J. Jakubik, S. Roy, C. E. Phillips, P. Fraccaro, D. Godwin, B. Zadrozny, D. Szwarcman, C. Gomes, G. Nyirjesy, B. Edwards, D. Kimura, N. Si- mumba, L. Chu, S. K. Mukkavilli, D. Lambhate, K. Das, R. Bangalore, 9 D. Oliveira, M. Muszynski, K. Ankur, M. Ramasubramanian, I. Gurung, S. Khallaghi, Hanxi, Li, M. Cecil, M. Ahmadi, F. Kordi, H. Alemoham- mad, M. Ma...

work page 2023

[6] [6]

Satlaspretrain: A large-scale dataset for remote sensing image under- standing,

F. Bastani, P. Wolters, R. Gupta, J. Ferdinando, and A. Kembhavi, “Satlaspretrain: A large-scale dataset for remote sensing image under- standing,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 772–16 782

work page 2023

[7] [7]

Rethinking transformers pre-training for multi-spectral satellite imagery,

M. Noman, M. Naseer, H. Cholakkal, R. M. Anwer, S. Khan, and F. S. Khan, “Rethinking transformers pre-training for multi-spectral satellite imagery,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024, pp. 27 811–27 819

work page 2024

[8] [8]

Neural plasticity-inspired foundation model for observing the earth crossing modalities,

Z. Xiong, Y . Wang, F. Zhang, A. J. Stewart, J. Hanna, D. Borth, I. Papoutsis, B. L. Saux, G. Camps-Valls, and X. X. Zhu, “Neural plasticity-inspired foundation model for observing the earth crossing modalities,”arXiv preprint, 2024

work page 2024

[9] [9]

Spectralgpt: Spectral remote sensing foundation model,

D. Hong, B. Zhang, X. Li, Y . Li, C. Li, J. Yao, N. Yokoya, H. Li, P. Ghamisi, X. Jia, A. Plaza, P. Gamba, J. A. Benediktsson, and J. Chanussot, “Spectralgpt: Spectral remote sensing foundation model,” IEEE Transactions on Pattern Analysis and Machine Intelligence, vol. 46, no. 8, p. 5227–5244, Aug. 2024

work page 2024

[10] [10]

Self-supervised material and texture representation learning for remote sensing tasks,

P. Akiva, M. Purri, and M. Leotta, “Self-supervised material and texture representation learning for remote sensing tasks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 8203–8215

work page 2022

[11] [11]

A con- volution with transformer attention module integrating local and global features for object detection in remote sensing based on YOLOv8n,

K. Lang, J. Cui, M. Yang, H. Wang, Z. Wang, and H. Shen, “A con- volution with transformer attention module integrating local and global features for object detection in remote sensing based on YOLOv8n,” Remote Sensing, vol. 16, no. 5, p. 906, 2024

work page 2024

[12] [12]

Ringmo-lite: A remote sensing lightweight network with cnn-transformer hybrid framework,

Y . Wang, T. Zhang, L. Zhao, L. Hu, Z. Wang, Z. Niu, P. Cheng, K. Chen, X. Zeng, Z. Wang, H. Wang, and X. Sun, “Ringmo-lite: A remote sensing lightweight network with cnn-transformer hybrid framework,” IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1– 20, 2024

work page 2024

[13] [13]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classi- fication,

P. Helber, B. Bischke, A. Dengel, and D. Borth, “Eurosat: A novel dataset and deep learning benchmark for land use and land cover classi- fication,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 2019

work page 2019

[14] [14]

Remote sensing image scene classifi- cation: Benchmark and state of the art,

G. Cheng, J. Han, and X. Lu, “Remote sensing image scene classifi- cation: Benchmark and state of the art,”Proceedings of the IEEE, vol. 105, no. 10, pp. 1865–1883, 2017

work page 2017

[15] [15]

reben: Refined bigearthnet dataset for remote sensing image analysis,

K. N. Clasen, L. Hackel, T. Burgert, G. Sumbul, B. Demir, and V . Markl, “reben: Refined bigearthnet dataset for remote sensing image analysis,” 2025

work page 2025

[16] [16]

Bag-of-visual-words and spatial extensions for land-use classification

Y . Yang and S. Newsam, “Bag-of-visual-words and spatial extensions for land-use classification.” New York, NY , USA: Association for Computing Machinery, 2010

work page 2010

[17] [17]

Urban change detection for multispectral earth observation using convolutional neural networks,

R. Caye Daudt, B. Le Saux, A. Boulch, and Y . Gousseau, “Urban change detection for multispectral earth observation using convolutional neural networks,” inIEEE International Geoscience and Remote Sensing Symposium (IGARSS), July 2018

work page 2018

[18] [18]

Consecutive pre-training: A knowledge transfer learning strategy with relevant unlabeled data for remote sensing domain,

T. Zhang, P. Gao, H. Dong, Y . Zhuang, G. Wang, W. Zhang, and H. Chen, “Consecutive pre-training: A knowledge transfer learning strategy with relevant unlabeled data for remote sensing domain,” Remote Sensing, vol. 14, no. 22, p. 5675, 2022

work page 2022

[19] [19]

Unified perceptual parsing for scene understanding,

T. Xiao, Y . Liu, B. Zhou, Y . Jiang, and J. Sun, “Unified perceptual parsing for scene understanding,” inProceedings of the European Con- ference on Computer Vision (ECCV). Springer International Publishing, 2018, pp. 432–448

work page 2018

[20] [20]

Rsbuilding: Towards general remote sensing image building extraction and change detection with foundation model,

M. Wang, L. Su, C. Yan, S. Xu, H. Zhang, P. Yuan, X. Jiang, and B. Zhang, “Rsbuilding: Towards general remote sensing image building extraction and change detection with foundation model,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–17, 2024

work page 2024

[21] [21]

Segformer: Simple and efficient design for semantic segmentation with transformers,

E. Xie, W. Wang, Z. Yu, A. Anandkumar, J. M. Alvarez, and P. Luo, “Segformer: Simple and efficient design for semantic segmentation with transformers,” inAdvances in Neural Information Processing Systems (NeurIPS), vol. 34, 2021, pp. 12 077–12 090

work page 2021

[22] [22]

Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery,

Y . Cong, S. Khannaet al., “Satmae: Pre-training transformers for tempo- ral and multi-spectral satellite imagery,”Advances in Neural Information Processing Systems (NeurIPS), vol. 35, pp. 197–211, 2022

work page 2022

[23] [23]

Global and local contrastive self-supervised learning for semantic segmentation of hr remote sensing images,

H. Li, Y . Li, G. Zhang, R. Liu, H. Huang, Q. Zhu, and C. Tao, “Global and local contrastive self-supervised learning for semantic segmentation of hr remote sensing images,”IEEE Transactions on Geoscience and Remote Sensing, vol. 60, p. 1–14, 2022

work page 2022

[24] [24]

Swin MAE: masked autoencoders for small datasets,

Y . Dai, F. Liu, W. Chen, Y . Liu, L. Shi, S. Liu, Y . Zhouet al., “Swin MAE: masked autoencoders for small datasets,”Computers in biology and medicine, vol. 161, p. 107037, 2023

work page 2023

[25] [25]

Swin transformer: Hierarchical vision transformer using shifted windows,

Z. Liu, Y . Lin, Y . Cao, H. Hu, Y . Wei, Z. Zhang, S. Lin, and B. Guo, “Swin transformer: Hierarchical vision transformer using shifted windows,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pp. 9992–10 002

work page

[26] [26]

2D Semantic Labeling Potsdam,

ISPRS, “2D Semantic Labeling Potsdam,” 2018. [Online]. Available: www.isprs.org/resources/datasets/benchmarks/UrbanSemLab/ 2d-sem-label-potsdam.aspx

work page 2018

[27] [27]

Spatial-temporal attention neural network for building change detection in remote sensing images,

H. Chen, Z. Shi, Z. Zhang, and X. Liu, “Spatial-temporal attention neural network for building change detection in remote sensing images,” Remote Sensing, vol. 12, no. 10, p. 1667, 2020

work page 2020

[28] [28]

Towards geospatial foundation models via continual pretraining,

M. Mendieta, B. Han, X. Shi, Y . Zhu, and C. Chen, “Towards geospatial foundation models via continual pretraining,” inProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 16 760–16 770

work page 2023

[29] [29]

Aitlas-arena: Repository for remote sensing,

B. V . Labs, “Aitlas-arena: Repository for remote sensing,” https://github. com/biasvariancelabs/aitlas-arena, 2024, accessed: 2024-10-08

work page 2024

[30] [30]

MMSegmentation: OpenMMLab Semantic Seg- mentation Toolbox and Benchmark,

MMS Contributors, “MMSegmentation: OpenMMLab Semantic Seg- mentation Toolbox and Benchmark,” https://github.com/open-mmlab/ mmsegmentation, 2020, accessed: 2024-10-08

work page 2020

[31] [31]

Cmid: A unified self- supervised learning framework for remote sensing image understanding,

D. Muhtar, X. Zhang, P. Xiao, Z. Li, and F. Gu, “Cmid: A unified self- supervised learning framework for remote sensing image understanding,” IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1– 17, 2023

work page 2023

[32] [32]

Masked autoencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked autoencoders are scalable vision learners,” in2022 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR), 2022, pp. 15 979–15 988

work page 2022

[33] [33]

Convmae: Masked convolution meets masked autoencoders,

P. Gao, T. Ma, H. Li, Z. Lin, J. Dai, and Y . Qiao, “Convmae: Masked convolution meets masked autoencoders,”arXiv preprint, 2022

work page 2022

[34] [34]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning,

C. J. Reed, R. Gupta, S. Li, S. Brockman, C. Funk, B. Clipp, K. Keutzer, S. Candido, M. Uyttendaele, and T. Darrell, “Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning,” in2023 IEEE/CVF International Conference on Computer Vision (ICCV), 2023, pp. 4065–4076

work page 2023

[35] [35]

Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,

O. Ma ˜nas, A. Lacoste, X. Gir ´o-i Nieto, D. Vazquez, and P. Rodr ´ıguez, “Seasonal contrast: Unsupervised pre-training from uncurated remote sensing data,” in2021 IEEE/CVF International Conference on Computer Vision (ICCV), 2021, pp. 9394–9403

work page 2021

[36] [36]

A unet-like hybrid transformer for efficient semantic segmentation of remote sensing images,

S. Liu and Y . Zhao, “A unet-like hybrid transformer for efficient semantic segmentation of remote sensing images,” in2023 5th International Conference on Geoscience and Remote Sensing Mapping (GRSM), 2023, pp. 149–154

work page 2023

[37] [37]

Cross-scale mae: A tale of multiscale exploitation in remote sensing,

M. Tang, A. Cozma, K. Georgiou, and H. Qi, “Cross-scale mae: A tale of multiscale exploitation in remote sensing,” inAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine, Eds., vol. 36, 2023, pp. 20 054– 20 066

work page 2023

[38] [38]

A billion-scale foundation model for remote sensing images,

K. Cha, J. Seo, and T. Lee, “A billion-scale foundation model for remote sensing images,”IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, p. 1–17, 2024

work page 2024