Rethinking Electro-Optical Vision Foundation Models for Remote Sensing Retrieval: A Controlled Comparison with Generalist VFM

Dong-Geol Choi; Hyobin Park; Minseok Seo

arxiv: 2605.02283 · v1 · submitted 2026-05-04 · 💻 cs.CV · cs.AI

Rethinking Electro-Optical Vision Foundation Models for Remote Sensing Retrieval: A Controlled Comparison with Generalist VFM

Hyobin Park , Minseok Seo , Dong-Geol Choi This is my paper

Pith reviewed 2026-05-09 15:55 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords remote sensingvision foundation modelsimage retrievalcross-scene generalizationelectro-optical imagerygeneralist modelsdomain-specific pretraining

0 comments

The pith

Strong generalist vision foundation models compete with or outperform electro-optical specific models in remote sensing image retrieval while generalizing more stably across scenes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper runs a controlled comparison to check whether vision foundation models pretrained specifically on electro-optical remote sensing imagery deliver better representations for image retrieval than generalist models trained on ordinary photographs. It applies identical datasets, retrieval protocols, and metrics to both types of models and measures accuracy both inside the original scenes and when the models must handle entirely new ones. The results indicate that capable generalist models perform at least as well as the specialized ones and suffer far less performance loss when scenes change, which undercuts the idea that remote-sensing-only pretraining is required for strong retrieval results. This observation matters because collecting and labeling remote sensing data is costly and expert-intensive, so evidence that general models can substitute reduces the need for repeated domain-specific training.

Core claim

In a controlled evaluation using the same remote sensing datasets and retrieval protocol, representative generalist vision foundation models match or exceed the performance of electro-optical specific models for in-domain retrieval and exhibit substantially less degradation when evaluated on new scenes, indicating that specialized pretraining on remote sensing imagery does not inherently produce superior retrieval-oriented representations.

What carries the argument

Controlled side-by-side evaluation of electro-optical specific versus generalist vision foundation models under identical in-domain and cross-scene remote sensing image retrieval protocols.

If this is right

Generalist models provide a practical alternative for remote sensing retrieval without requiring separate domain-specific pretraining.
Electro-optical specific models experience substantial degradation when tested on new scenes and therefore need improvements for reliable cross-scene use.
Pretraining on remote sensing data alone does not guarantee stronger retrieval representations than generalist pretraining.
Future electro-optical vision foundation models should incorporate physical, spatial, spectral, and geographic characteristics of the imagery more effectively.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The observed stability of generalist models may result from their exposure to far larger and more diverse training corpora that capture common visual structures across domains.
Controlled comparisons of the same kind could be extended to other remote sensing tasks such as segmentation or detection to determine whether the generalization advantage persists.
Hybrid pretraining that combines generalist data with targeted remote sensing examples might combine the strengths of both approaches.

Load-bearing premise

The selected representative electro-optical specific and generalist models, together with the chosen datasets and retrieval protocol, constitute a fair, unbiased, and generalizable comparison of the two modeling paradigms.

What would settle it

A replication that includes additional electro-optical specific models or different remote sensing datasets in which the specialized models show consistent superiority in cross-scene retrieval accuracy would undermine the central observation.

Figures

Figures reproduced from arXiv: 2605.02283 by Dong-Geol Choi, Hyobin Park, Minseok Seo.

**Figure 1.** Figure 1: Overview of our controlled comparison between EO-specific and generalist vision foundation models for remote sensing image view at source ↗

read the original abstract

Vision foundation models have attracted significant attention for their ability to leverage large-scale unlabeled visual data. This advantage is particularly important in remote sensing, where data acquisition is costly and annotation often requires expert knowledge. Recent electro-optical vision foundation models aim to learn domain-specific representations from remote sensing imagery, but it remains unclear whether they are more effective than strong generalist vision foundation models under retrieval-based evaluation. In this study, we conduct a controlled comparison between representative EO-specific and generalist vision foundation models for remote sensing image retrieval. Using the same datasets, retrieval protocol, and evaluation metric, we evaluate both in-domain performance and cross-scene generalization. Our results show that strong generalist vision foundation models are competitive with, and in some cases outperform, existing EO-specific models. Moreover, EO-specific models often suffer from substantial degradation under cross-scene evaluation, while generalist models show more stable transfer. These findings suggest that EO pretraining alone does not guarantee stronger retrieval-oriented remote sensing representations. We discuss the limitations of current EO-specific pretraining strategies and highlight the need for future EO vision foundation models to better exploit the physical, spatial, spectral, and geographic characteristics of remote sensing imagery.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Generalist VFMs match or beat the tested EO-specific models on retrieval and hold up better across scenes, but the EO model selection limits how much this indicts domain-specific pretraining overall.

read the letter

The paper's core finding is that strong generalist vision foundation models perform competitively with existing EO-specific ones on remote sensing retrieval and degrade less when scenes shift. They run the same datasets, protocol, and metric on both sides, which keeps the comparison interpretable and isolates the effect of pretraining type. The cross-scene numbers are the clearest part of the contribution; they show generalists transferring more stably, which matters for real remote sensing use where new locations are common. The discussion correctly flags that current EO pretraining has not yet made strong use of physical, spectral, or geographic structure. That observation is useful even if the numbers themselves are not dramatic. The main soft spot is the choice of EO-specific models. The claim that EO pretraining does not guarantee better representations rests on the tested instances being representative of the paradigm. If those models are narrower or less optimized than the strongest current EO approaches, the degradation could reflect implementation gaps rather than an inherent limit. The abstract's wording around 'existing' models leaves this open, and without seeing the exact selection criteria or ablations it is hard to judge how far the result generalizes. This work is aimed at remote sensing researchers who need to decide whether to invest in custom EO pretraining or start from a generalist backbone for retrieval tasks. A reader focused on practical cross-scene performance will get direct evidence to weigh. The empirical framing is clean enough to merit a serious referee, though the authors should strengthen the justification for their EO baselines and any statistical checks on the differences.

Referee Report

2 major / 3 minor

Summary. The paper conducts a controlled empirical comparison of representative electro-optical (EO) vision foundation models against strong generalist vision foundation models on remote sensing image retrieval tasks. Using identical datasets, retrieval protocols, and evaluation metrics, it measures both in-domain performance and cross-scene generalization. Results indicate that generalist models are competitive with or outperform EO-specific models, with the latter exhibiting greater performance degradation under cross-scene shifts; the authors conclude that EO pretraining alone does not guarantee superior retrieval-oriented representations and call for future models to better incorporate physical, spatial, spectral, and geographic cues.

Significance. If the comparison is representative, the findings carry substantial implications for remote sensing vision research by questioning the default emphasis on domain-specific pretraining and suggesting that generalist models may offer more robust and efficient starting points. The controlled experimental design (shared data, protocol, and metric) is a clear strength that enables direct attribution of differences to model paradigms rather than confounding factors. This could usefully redirect community efforts toward improved adaptation techniques or hybrid pretraining strategies that explicitly leverage remote-sensing physics.

major comments (2)

[§4.1] §4.1 (Model Selection and Representativeness): The central claim that 'existing EO-specific models' suffer from substantial cross-scene degradation and that EO pretraining does not guarantee stronger representations rests on the assumption that the selected EO models fairly instantiate the paradigm. The manuscript should explicitly justify the choice of these particular models (architectures, pretraining objectives, data scales) versus omitted recent or larger-scale EO VFMs, and discuss whether observed gaps could stem from suboptimal implementations rather than inherent limitations of domain-specific pretraining. This is load-bearing for generalizing from the evaluated instances to the broader paradigm.
[§4.3] §4.3 (Cross-Scene Evaluation Protocol): While the shared protocol is a strength, the paper does not report statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) on the performance deltas between generalist and EO models. Without these, it is difficult to assess whether the reported stability advantage of generalists is robust or could be explained by variance across the limited scene splits.

minor comments (3)

[§3] §3 (Related Work): The discussion of prior EO VFMs could more explicitly contrast their pretraining objectives (e.g., contrastive vs. reconstruction) with those of the generalist models to clarify what 'EO-specific' means operationally.
[Figure 2] Figure 2 and Table 2: Axis labels and captions should include the exact retrieval metric (e.g., mAP@K) and the number of query/gallery images per split to improve reproducibility.
[§5] §5 (Discussion): The limitations paragraph on current EO pretraining strategies is useful but could be expanded with concrete suggestions (e.g., incorporation of metadata or spectral bands) rather than remaining at a high level.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment point by point below, indicating the revisions we will incorporate.

read point-by-point responses

Referee: [§4.1] §4.1 (Model Selection and Representativeness): The central claim that 'existing EO-specific models' suffer from substantial cross-scene degradation and that EO pretraining does not guarantee stronger representations rests on the assumption that the selected EO models fairly instantiate the paradigm. The manuscript should explicitly justify the choice of these particular models (architectures, pretraining objectives, data scales) versus omitted recent or larger-scale EO VFMs, and discuss whether observed gaps could stem from suboptimal implementations rather than inherent limitations of domain-specific pretraining. This is load-bearing for generalizing from the evaluated instances to the broader paradigm.

Authors: We chose the EO-specific models as representative instances drawn from prominent recent works that cover diverse architectures (e.g., ViT-based), pretraining objectives (contrastive and reconstruction-based), and data scales typical of the EO VFM literature. In the revised manuscript we will expand §4.1 with an explicit justification subsection, including a comparison of the selected models' parameter counts, pretraining corpus sizes, and objectives against other recent or larger-scale EO VFMs. We will also acknowledge that performance gaps could partly reflect implementation details while noting that the controlled experimental design (identical data, protocol, and metrics) still allows attribution of differences to the domain-specific pretraining paradigm itself. revision: yes
Referee: [§4.3] §4.3 (Cross-Scene Evaluation Protocol): While the shared protocol is a strength, the paper does not report statistical significance tests (e.g., paired t-tests or bootstrap confidence intervals) on the performance deltas between generalist and EO models. Without these, it is difficult to assess whether the reported stability advantage of generalists is robust or could be explained by variance across the limited scene splits.

Authors: We agree that quantifying statistical robustness would strengthen the cross-scene analysis. In the revised version we will add bootstrap confidence intervals (1,000 resamples) on the mean performance deltas between generalist and EO models across the scene splits, reported alongside the existing metrics. This addition requires only post-hoc computation on the already-collected results and will directly address concerns about variance in the limited splits. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical comparison with external benchmarks

full rationale

The paper conducts a controlled empirical evaluation of existing vision foundation models on public remote sensing retrieval datasets using standard protocols and metrics. No mathematical derivations, equations, fitted parameters, or predictions are claimed; all results are direct measurements on held-out data. The central claims rest on observed performance differences rather than any self-referential construction, self-citation chain, or ansatz. The study is self-contained against external benchmarks and contains no load-bearing internal definitions that reduce to the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard machine-learning evaluation assumptions rather than new free parameters or invented entities. No numerical constants are fitted to produce the headline result.

axioms (2)

domain assumption The chosen EO-specific and generalist models are representative of their respective categories.
The abstract refers to 'representative' models without detailing selection criteria or exhaustive coverage of the model landscape.
domain assumption The datasets and retrieval protocol fairly represent the remote sensing image retrieval task.
Standard assumption in empirical computer-vision studies; the abstract does not discuss potential dataset biases or protocol limitations.

pith-pipeline@v0.9.0 · 5515 in / 1422 out tokens · 51965 ms · 2026-05-09T15:55:02.192280+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 7 canonical work pages · 3 internal anchors

[1]

Beit: Bert pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. InInternational Conference on Learning Representations, 2022. 2

2022
[2]

Emerg- ing properties in self-supervised vision transformers

Mathieu Caron, Hugo Touvron, Ishan Misra, Herv ´e J ´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021. 1

2021
[3]

An empiri- cal study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empiri- cal study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021. 1 Table 1. F1-score (%) comparison of self-supervised models for RS-CBIR on BEN-14K, FMoW-RGB, and FMoW-Sentinel datasets under in-domain and cros...

work page arXiv 2021
[4]

Rejepa: A novel joint-embedding predic- tive architecture for efficient remote sensing image retrieval

Shabnam Choudhury, Yash Salunkhe, Sarthak Mehrotra, and Biplab Banerjee. Rejepa: A novel joint-embedding predic- tive architecture for efficient remote sensing image retrieval. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 2373–2382, 2025. 2

2025
[5]

Functional map of the world

Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018. 3

2018
[6]

Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022. 2

2022
[7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 248–255, 2009. 3

2009
[8]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations, 2021. 1

2021
[9]

Exploring masked autoencoders for sensor- agnostic image retrieval in remote sensing.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–14, 2024

Jakob Hackstein, Gencer Sumbul, Kai Norman Clasen, and Beg¨um Demir. Exploring masked autoencoders for sensor- agnostic image retrieval in remote sensing.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–14, 2024. 2

2024
[10]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 1, 2

2022
[11]

Johannes Jakubik, Sujit Roy, C. E. Phillips, Paolo Fraccaro, Denys Godwin, Bianca Zadrozny, Daniela Szwarcman, Car- los Gomes, Gabby Nyirjesy, Blair Edwards, et al. Foundation models for generalist geospatial artificial intelligence.arXiv preprint arXiv:2310.18660, 2023. 2

work page arXiv 2023
[12]

arXiv preprint arXiv:2208.02131 , year=

Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Er- han Bas, Rahul Bhotika, and Stefano Soatto. Masked vision and language modeling for multi-modal representation learn- ing.arXiv preprint arXiv:2208.02131, 2022. 2

work page arXiv 2022
[13]

Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

2024
[14]

Towards geospatial foundation models via con- tinual pretraining

Matias Mendieta, Boran Han, Xingjian Shi, Yi Zhu, and Chen Chen. Towards geospatial foundation models via con- tinual pretraining. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 16806–16816, 2023

2023
[15]

Rethinking transformers pre-training for multi- spectral satellite imagery

Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shah- baz Khan. Rethinking transformers pre-training for multi- spectral satellite imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27811–27819, 2024. 2

2024
[16]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[17]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2

2021
[18]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- man, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 2

2023
[19]

arXiv preprint arXiv:2511.16301 (2025)

Minseok Seo, Mark Hamilton, and Changick Kim. Upsam- ple anything: A simple and hard to beat baseline for feature upsampling.arXiv preprint arXiv:2511.16301, 2025. 1

work page arXiv 2025
[20]

Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation

Minseok Seo, Wonjun Lee, Jaehyuk Jang, and Chang- ick Kim. Efficient test-time optimization for depth com- pletion via low-rank decoder adaptation.arXiv preprint arXiv:2603.01765, 2026. 1

work page internal anchor Pith review Pith/arXiv arXiv 2026
[21]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Content-based image retrieval at the end of the early years.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349– 1380, 2000

Arnold WM Smeulders, Marcel Worring, Simona Santini, Amarnath Gupta, and Ramesh Jain. Content-based image retrieval at the end of the early years.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349– 1380, 2000. 4

2000
[23]

Bigearthnet: A large-scale benchmark archive for remote sensing image understanding

Gencer Sumbul, Marcela Charfuelan, Beg ¨um Demir, and V olker Markl. Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. pages 5901–5904,
[24]

Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022

Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiao- nan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, et al. Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022. 2

2022
[25]

An empirical study of remote sensing pretraining.IEEE Transactions on Geoscience and Remote Sensing, 61:1–20, 2023

Di Wang, Jing Zhang, Bo Du, Gui-Song Xia, and Dacheng Tao. An empirical study of remote sensing pretraining.IEEE Transactions on Geoscience and Remote Sensing, 61:1–20, 2023

2023
[26]

Advancing plain vision transformer toward remote sensing foundation model.IEEE Transactions on Geoscience and Remote Sensing, 61:1–15,

Di Wang, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du, Dacheng Tao, and Liangpei Zhang. Advancing plain vision transformer toward remote sensing foundation model.IEEE Transactions on Geoscience and Remote Sensing, 61:1–15,
[27]

A comprehensive study of transfer learning.Proceedings of the IEEE, 109(1):43–76, 2020

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive study of transfer learning.Proceedings of the IEEE, 109(1):43–76, 2020. 1

2020

[1] [1]

Beit: Bert pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. Beit: Bert pre-training of image transformers. InInternational Conference on Learning Representations, 2022. 2

2022

[2] [2]

Emerg- ing properties in self-supervised vision transformers

Mathieu Caron, Hugo Touvron, Ishan Misra, Herv ´e J ´egou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerg- ing properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9650–9660, 2021. 1

2021

[3] [3]

An empiri- cal study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empiri- cal study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision, pages 9640–9649, 2021. 1 Table 1. F1-score (%) comparison of self-supervised models for RS-CBIR on BEN-14K, FMoW-RGB, and FMoW-Sentinel datasets under in-domain and cros...

work page arXiv 2021

[4] [4]

Rejepa: A novel joint-embedding predic- tive architecture for efficient remote sensing image retrieval

Shabnam Choudhury, Yash Salunkhe, Sarthak Mehrotra, and Biplab Banerjee. Rejepa: A novel joint-embedding predic- tive architecture for efficient remote sensing image retrieval. InProceedings of the Computer Vision and Pattern Recogni- tion Conference, pages 2373–2382, 2025. 2

2025

[5] [5]

Functional map of the world

Gordon Christie, Neil Fendley, James Wilson, and Ryan Mukherjee. Functional map of the world. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 6172–6180, 2018. 3

2018

[6] [6]

Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022. 2

2022

[7] [7]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. InProceedings of the IEEE Conference on Com- puter Vision and Pattern Recognition, pages 248–255, 2009. 3

2009

[8] [8]

An image is worth 16x16 words: Trans- formers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Syl- vain Gelly, et al. An image is worth 16x16 words: Trans- formers for image recognition at scale. InInternational Con- ference on Learning Representations, 2021. 1

2021

[9] [9]

Exploring masked autoencoders for sensor- agnostic image retrieval in remote sensing.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–14, 2024

Jakob Hackstein, Gencer Sumbul, Kai Norman Clasen, and Beg¨um Demir. Exploring masked autoencoders for sensor- agnostic image retrieval in remote sensing.IEEE Transac- tions on Geoscience and Remote Sensing, 63:1–14, 2024. 2

2024

[10] [10]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll´ar, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 16000– 16009, 2022. 1, 2

2022

[11] [11]

Johannes Jakubik, Sujit Roy, C. E. Phillips, Paolo Fraccaro, Denys Godwin, Bianca Zadrozny, Daniela Szwarcman, Car- los Gomes, Gabby Nyirjesy, Blair Edwards, et al. Foundation models for generalist geospatial artificial intelligence.arXiv preprint arXiv:2310.18660, 2023. 2

work page arXiv 2023

[12] [12]

arXiv preprint arXiv:2208.02131 , year=

Gukyeong Kwon, Zhaowei Cai, Avinash Ravichandran, Er- han Bas, Rahul Bhotika, and Stefano Soatto. Masked vision and language modeling for multi-modal representation learn- ing.arXiv preprint arXiv:2208.02131, 2022. 2

work page arXiv 2022

[13] [13]

Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 62:1–16, 2024

2024

[14] [14]

Towards geospatial foundation models via con- tinual pretraining

Matias Mendieta, Boran Han, Xingjian Shi, Yi Zhu, and Chen Chen. Towards geospatial foundation models via con- tinual pretraining. InProceedings of the IEEE/CVF Interna- tional Conference on Computer Vision, pages 16806–16816, 2023

2023

[15] [15]

Rethinking transformers pre-training for multi- spectral satellite imagery

Mubashir Noman, Muzammal Naseer, Hisham Cholakkal, Rao Muhammad Anwer, Salman Khan, and Fahad Shah- baz Khan. Rethinking transformers pre-training for multi- spectral satellite imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27811–27819, 2024. 2

2024

[16] [16]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. Dinov2: Learning robust visual features without supervision. arXiv preprint arXiv:2304.07193, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023

[17] [17]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 2

2021

[18] [18]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- man, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 2

2023

[19] [19]

arXiv preprint arXiv:2511.16301 (2025)

Minseok Seo, Mark Hamilton, and Changick Kim. Upsam- ple anything: A simple and hard to beat baseline for feature upsampling.arXiv preprint arXiv:2511.16301, 2025. 1

work page arXiv 2025

[20] [20]

Efficient Test-Time Optimization for Depth Completion via Low-Rank Decoder Adaptation

Minseok Seo, Wonjun Lee, Jaehyuk Jang, and Chang- ick Kim. Efficient test-time optimization for depth com- pletion via low-rank decoder adaptation.arXiv preprint arXiv:2603.01765, 2026. 1

work page internal anchor Pith review Pith/arXiv arXiv 2026

[21] [21]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Content-based image retrieval at the end of the early years.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349– 1380, 2000

Arnold WM Smeulders, Marcel Worring, Simona Santini, Amarnath Gupta, and Ramesh Jain. Content-based image retrieval at the end of the early years.IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(12):1349– 1380, 2000. 4

2000

[23] [23]

Bigearthnet: A large-scale benchmark archive for remote sensing image understanding

Gencer Sumbul, Marcela Charfuelan, Beg ¨um Demir, and V olker Markl. Bigearthnet: A large-scale benchmark archive for remote sensing image understanding. pages 5901–5904,

[24] [24]

Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022

Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiao- nan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, et al. Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022. 2

2022

[25] [25]

An empirical study of remote sensing pretraining.IEEE Transactions on Geoscience and Remote Sensing, 61:1–20, 2023

Di Wang, Jing Zhang, Bo Du, Gui-Song Xia, and Dacheng Tao. An empirical study of remote sensing pretraining.IEEE Transactions on Geoscience and Remote Sensing, 61:1–20, 2023

2023

[26] [26]

Advancing plain vision transformer toward remote sensing foundation model.IEEE Transactions on Geoscience and Remote Sensing, 61:1–15,

Di Wang, Qiming Zhang, Yufei Xu, Jing Zhang, Bo Du, Dacheng Tao, and Liangpei Zhang. Advancing plain vision transformer toward remote sensing foundation model.IEEE Transactions on Geoscience and Remote Sensing, 61:1–15,

[27] [27]

A comprehensive study of transfer learning.Proceedings of the IEEE, 109(1):43–76, 2020

Fuzhen Zhuang, Zhiyuan Qi, Keyu Duan, Dongbo Xi, Yongchun Zhu, Hengshu Zhu, Hui Xiong, and Qing He. A comprehensive study of transfer learning.Proceedings of the IEEE, 109(1):43–76, 2020. 1

2020