SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

Puhong Duan; Qiwei Ma; Shutao Li; Wang Liu; Xudong Kang; Xukun Lu

arxiv: 2510.22665 · v3 · pith:E6YIS3XWnew · submitted 2025-10-26 · 💻 cs.CV · cs.AI

SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

Qiwei Ma , Xukun Lu , Wang Liu , Puhong Duan , Xudong Kang , Shutao Li This is my paper

Pith reviewed 2026-05-21 20:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords SAR imageryvision-language modelfoundation modeldomain transferimage-text retrievalsemantic understandingremote sensingSARVLM-1M

0 comments

The pith

SARVLM is the first vision-language foundation model for SAR imagery, built with a million-scale dataset and optical remote sensing data as a bridge to transfer knowledge from natural images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to create an effective vision-language model for SAR images, which have been held back by scarce paired text data and large visual differences from natural scenes. It assembles SARVLM-1M with over one million image-text pairs and introduces a two-stage training process that routes learning through optical remote sensing images. This matters because SAR supplies reliable all-weather observation for monitoring, defense, and disaster response, and stronger cross-modal understanding would let systems retrieve, detect, and describe SAR content without task-specific retraining. If the approach works, it shows that indirect domain transfer can close the gap between everyday images and radar returns.

Core claim

SARVLM, consisting of SARCLIP and SARCap, is developed on the SARVLM-1M dataset of more than one million image-text pairs through a two-stage domain transfer training strategy that uses optical remote sensing data as an intermediate bridge to move knowledge from natural images into the SAR domain; an ensemble strategy further improves cross-scene generalization. The resulting model, together with SARDet and SARRot extensions, produces stronger feature extraction and interpretation than prior vision-language models across thirteen benchmarks covering image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning.

What carries the argument

Two-stage domain transfer training strategy that treats optical remote sensing imagery as an intermediate bridge to adapt natural-image knowledge to SAR.

If this is right

SARVLM improves accuracy on image-text retrieval, zero-shot classification, and image captioning for SAR scenes.
The same framework yields stronger object detection results when instantiated as SARDet and SARRot.
The ensemble component increases robustness across different imaging scenes and conditions.
Semantic localization and target recognition tasks become more reliable without task-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The optical-bridge technique could be tested on other radar-like modalities such as sonar or ground-penetrating radar where paired text data is also scarce.
Operational pipelines that fuse live SAR streams with text queries might become feasible once the model is quantized for edge hardware.
The dataset-construction method offers a template for building multimodal corpora in other remote-sensing domains that lack direct text labels.

Load-bearing premise

Optical remote sensing data can serve as an effective intermediate bridge to transfer knowledge from natural images to the SAR domain despite the substantial differences between SAR and natural imagery.

What would settle it

Retraining the same architecture on SARVLM-1M without the optical remote sensing bridge stage and measuring whether the performance advantage over existing vision-language models on the thirteen benchmarks disappears or reverses.

Figures

Figures reproduced from arXiv: 2510.22665 by Puhong Duan, Qiwei Ma, Shutao Li, Wang Liu, Xudong Kang, Xukun Lu.

**Figure 2.** Figure 2: The paradigm for foundation model: (a) CL-based methods, (b) MIM-based methods, (c) CLIP-based methods. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: (a) Examples from the SARVLM-1M dataset; (b) Two-stage domain transfer training strategy for SARCLIP. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗

**Figure 4.** Figure 4: The framework of SARCap method. 3.1. Problem Definition In this section, we investigate the paradigm of learning joint representations from SAR images and their corresponding textual descriptions. Specifically, we construct SARVLM1M dataset D = {(Ii , Ti)}M i=1 consisting of SAR images Ii ∈ RH×W with the corresponding descriptions Ti ∈ T . As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗

**Figure 6.** Figure 6: Feature space visualization of SARCLIP‡ image encoder on three downstream datasets (ViT-L-14). domain. 4.4.2. Target recognition results on MSTAR-SOC and SAR-VSA dataset [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Visualization of results on semantic localization task [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 9.** Figure 9: Ablation study on training layers of SARCLIP [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗

read the original abstract

Synthetic Aperture Radar (SAR) is a critical imaging modality due to its all-weather operational capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these approaches primarily focus on low-level visual features and often neglect multi-modal representation. Moreover, multimodal data for SAR is scarce, limiting the development of robust cross-modal models. To address this limitation, we construct SARVLM-1M, a large-scale vision-language dataset comprising over one million image-text pairs aggregated from existing datasets. Furthermore, to mitigate the substantial differences between SAR and natural imagery, we propose a two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge, facilitating effective knowledge transfer from natural images to SAR domains. Based on this strategy, we develop SARVLM, the first vision-language foundation model tailored for SAR, consisting of SARCLIP and SARCap. In addition, an ensemble strategy is utilized to improve the cross-scene generalization capability of the model. Moreover, SARDet and SARRot further validate the capability of the proposed framework in object detection. Extensive experiments on 13 benchmarks across image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning demonstrate the superior feature extraction and interpretation capabilities of SARVLM. It consistently outperforms state-of-the-art vision-language models and advances semantic understanding in SAR imagery. Code and datasets will be released on https://github.com/KlayMa527/SARVLM.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SARVLM is the first vision-language model for SAR with a 1M-pair dataset and two-stage optical bridge transfer, but the performance edge over prior work hinges on unablated claims about that bridge.

read the letter

The main takeaway is that this paper delivers the first dedicated vision-language foundation model for SAR imagery. They aggregate SARVLM-1M from existing sources to create over a million image-text pairs and train through a two-stage process that routes knowledge from natural images via optical remote sensing data before reaching SAR. SARCLIP and SARCap form the core, with an ensemble added for cross-scene robustness and extensions to detection tasks like SARDet and SARRot.

Referee Report

2 major / 2 minor

Summary. The paper introduces SARVLM, the first vision-language foundation model for SAR imagery. It constructs the SARVLM-1M dataset comprising over one million image-text pairs and proposes a two-stage domain transfer strategy that uses optical remote sensing data as an intermediate bridge to transfer knowledge from natural images to SAR. The model consists of SARCLIP and SARCap components, incorporates an ensemble strategy for cross-scene generalization, and includes SARDet and SARRot for object detection validation. Extensive experiments claim consistent outperformance over state-of-the-art vision-language models across 13 benchmarks spanning image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning.

Significance. If the performance claims are supported by rigorous ablations, statistical tests, and clear baseline comparisons, this work would represent a meaningful advance in multimodal semantic understanding for SAR imagery, addressing the scarcity of SAR vision-language data and the domain gap with natural images. The construction and planned release of the SARVLM-1M dataset would provide a valuable resource for the community. The two-stage transfer approach, if validated, could offer a practical template for domain adaptation in remote sensing modalities.

major comments (2)

[Abstract and §3] Abstract and §3 (Two-stage domain transfer): The central claim that the two-stage strategy (natural images → optical remote sensing → SAR) mitigates substantial SAR-natural differences and drives the reported gains is load-bearing for explaining outperformance over SOTA VLMs, yet no ablation studies isolate its contribution versus single-stage direct adaptation, dataset scale alone, or architecture choices in SARCLIP/SARCap. This leaves the effectiveness of the optical RS bridge as an untested assumption.
[§4 (Experiments)] §4 (Experiments) and associated tables: The assertion of consistent outperformance on 13 benchmarks lacks reported details on exact baselines, data splits, statistical significance testing, or variance across runs. Without these, it is impossible to assess whether gains are robust or influenced by post-hoc choices, undermining the cross-task superiority claim.

minor comments (2)

[§3] The notation for SARCLIP and SARCap components could be clarified with explicit architectural diagrams or equations showing how they differ from standard CLIP and captioning heads.
[§4] Ensure all benchmark results include the number of runs or error bars to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below and outline the revisions we will make to improve the rigor and transparency of the manuscript.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Two-stage domain transfer): The central claim that the two-stage strategy (natural images → optical remote sensing → SAR) mitigates substantial SAR-natural differences and drives the reported gains is load-bearing for explaining outperformance over SOTA VLMs, yet no ablation studies isolate its contribution versus single-stage direct adaptation, dataset scale alone, or architecture choices in SARCLIP/SARCap. This leaves the effectiveness of the optical RS bridge as an untested assumption.

Authors: We agree that dedicated ablations are required to substantiate the contribution of the two-stage domain transfer. The current manuscript motivates the optical remote sensing bridge based on the domain gap but does not isolate its effect. In the revised manuscript we will add ablation experiments in §3 and §4 that compare (i) the full two-stage pipeline against direct single-stage adaptation from natural images to SAR, (ii) performance at different dataset scales, and (iii) variations in SARCLIP/SARCap architecture while keeping the transfer strategy fixed. These results will be presented with quantitative deltas to clarify the incremental benefit of the intermediate optical RS stage. revision: yes
Referee: [§4 (Experiments)] §4 (Experiments) and associated tables: The assertion of consistent outperformance on 13 benchmarks lacks reported details on exact baselines, data splits, statistical significance testing, or variance across runs. Without these, it is impossible to assess whether gains are robust or influenced by post-hoc choices, undermining the cross-task superiority claim.

Authors: We acknowledge that the experimental section requires additional detail for reproducibility and statistical rigor. The revised §4 and tables will specify: (a) exact baseline models with citations and implementation details, (b) the precise train/validation/test splits used for each of the 13 benchmarks, (c) statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) comparing SARVLM against baselines, and (d) mean and standard deviation of key metrics across at least three random seeds. These additions will allow readers to evaluate the robustness of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks validate proposed two-stage transfer

full rationale

The paper constructs the SARVLM-1M dataset and proposes a two-stage training strategy that uses optical remote sensing data as an intermediate bridge for knowledge transfer from natural images to SAR. It then trains SARCLIP and SARCap components and reports outperformance on 13 empirical benchmarks for retrieval, classification, detection, and captioning. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs or to self-citations. The central claims rest on experimental results rather than self-definitional loops, renamed known results, or load-bearing self-citations. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the effectiveness of the two-stage domain transfer and the representativeness of the aggregated SARVLM-1M dataset; many training hyperparameters and exact aggregation rules are not specified in the abstract.

free parameters (1)

Two-stage domain transfer hyperparameters
Learning rates, epochs, and loss weights for the intermediate optical bridge stage are not detailed but must be chosen or fitted to achieve the reported transfer.

axioms (1)

domain assumption Optical remote sensing imagery shares sufficient structural properties with both natural images and SAR to act as an effective knowledge-transfer bridge.
Invoked directly in the description of the two-stage training strategy to mitigate differences between SAR and natural imagery.

invented entities (3)

SARVLM-1M no independent evidence
purpose: Large-scale vision-language training dataset for SAR
Aggregated from existing datasets; no independent validation of quality or coverage is provided in the abstract.
SARCLIP no independent evidence
purpose: SAR-adapted image-text alignment model
Core component of the proposed SARVLM framework.
SARCap no independent evidence
purpose: SAR image captioning module
Core component of the proposed SARVLM framework.

pith-pipeline@v0.9.0 · 5818 in / 1610 out tokens · 54754 ms · 2026-05-21T20:54:58.018580+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge... SARCLIP... contrastive loss... InfoNCE-based
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

SARVLM-1M... 1.7 million image-text pairs... templated text synthesis

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model
eess.IV 2026-05 unverdicted novelty 6.0

Introduces the SMART-HC-VQA dataset with 65k single-image and 2.3M temporal VQA examples plus an adapted LLaVA-NeXT MLLM framework for geospatial-temporal sensemaking of remote sensing construction activity.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

The air force moving and stationary target recog- nition database.https://www.sdms.afrl.af.mil/ index.php?collection=mstar, 1995

AFR Lab. The air force moving and stationary target recog- nition database.https://www.sdms.afrl.af.mil/ index.php?collection=mstar, 1995. 4, 5

work page 1995
[2]

Omnisat: Self-supervised modality fusion for earth observation

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024. 3

work page 2024
[3]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 6

work page 2005
[4]

Tar- get classification using the deep convolutional networks for sar images.IEEE Transactions on Geoscience and Remote Sensing, 54(8):4806–4817, 2016

Sizhe Chen, Haipeng Wang, Feng Xu, and Ya-Qiu Jin. Tar- get classification using the deep convolutional networks for sar images.IEEE Transactions on Geoscience and Remote Sensing, 54(8):4806–4817, 2016. 5

work page 2016
[5]

A simple framework for contrastive learn- ing of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learn- ing of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020. 1

work page 2020
[6]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 6

work page 2023
[7]

Rouge: A package for automatic evaluation of summaries

Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries. InProceedings of the Workshop on Text Sum- marization Branches Out, 2004, 2004. 6

work page 2004
[8]

Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022. 3

work page 2022
[9]

Hyperspectral and sar image classification via graph convolutional fusion network.IEEE Transactions on Geoscience and Remote Sensing, 2024

Bin Deng, Puhong Duan, Xukun Lu, Zihao Wang, and Xudong Kang. Hyperspectral and sar image classification via graph convolutional fusion network.IEEE Transactions on Geoscience and Remote Sensing, 2024. 1

work page 2024
[10]

Rethinking remote sensing clip: Lever- aging multimodal large language models for high-quality vision-language dataset

Yiguo He, Junjie Zhu, Yiying Li, Qiangjuan Huang, Zhiyuan Wang, and Ke Yang. Rethinking remote sensing clip: Lever- aging multimodal large language models for high-quality vision-language dataset. InInternational Conference on Neural Information Processing, pages 417–431. Springer,

work page
[11]

Enhancing remote sensing vision-language models through mllm and llm-based high-quality image-text dataset generation.arXiv preprint arXiv:2507.16716, 2025

Yiguo He, Junjie Zhu, Yiying Li, Xiaoyu Zhang, Chunping Qiu, Jun Wang, Qiangjuan Huang, and Ke Yang. Enhancing remote sensing vision-language models through mllm and llm-based high-quality image-text dataset generation.arXiv preprint arXiv:2507.16716, 2025. 6

work page arXiv 2025
[12]

Fusar-ship: Building a high-resolution sar-ais matchup dataset of gaofen-3 for ship detection and recog- nition.Science China Information Sciences, 63(4):140303,

Xiyue Hou, Wei Ao, Qian Song, Jian Lai, Haipeng Wang, and Feng Xu. Fusar-ship: Building a high-resolution sar-ais matchup dataset of gaofen-3 for ship detection and recog- nition.Science China Information Sciences, 63(4):140303,

work page
[13]

Pallavi Jain, Bianca Schoen-Phelan, and Robert Ross. Self- supervised learning for invariant representations from multi- spectral and sar images.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:7797– 7808, 2022. 2

work page 2022
[14]

Sfr-net: Scattering feature relation network for aircraft detection in complex sar images.IEEE Transactions on Geo- science and Remote Sensing, 60:1–17, 2021

Yuzhuo Kang, Zhirui Wang, Jiamei Fu, Xian Sun, and Kun Fu. Sfr-net: Scattering feature relation network for aircraft detection in complex sar images.IEEE Transactions on Geo- science and Remote Sensing, 60:1–17, 2021. 1

work page 2021
[15]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 3

work page 2024
[16]

Synthetic sar image generation using sensor, terrain and target models

Anders Kusk, Adili Abulaitijiang, and Jorgen Dall. Synthetic sar image generation using sensor, terrain and target models. InProceedings of EUSAR 2016: 11th European Conference on Synthetic Aperture Radar, pages 1–5. VDE, 2016. 4

work page 2016
[17]

A sar dataset for atr development: the synthetic and measured paired labeled experiment (sample)

Benjamin Lewis, Theresa Scarnati, Elizabeth Sudkamp, John Nehrbass, Stephen Rosencrantz, and Edmund Zelnio. A sar dataset for atr development: the synthetic and measured paired labeled experiment (sample). InAlgorithms for Syn- thetic Aperture Radar Imagery XXVI, pages 39–54. SPIE,

work page
[18]

Opensarship 2.0: A large-volume dataset for deeper interpretation of ship targets in sentinel-1 imagery

Boying Li, Bin Liu, Lanqing Huang, Weiwei Guo, Zenghui Zhang, and Wenxian Yu. Opensarship 2.0: A large-volume dataset for deeper interpretation of ship targets in sentinel-1 imagery. In2017 SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), pages 1–5. IEEE, 2017. 4

work page 2017
[19]

Weijie Li, Wei Yang, Tianpeng Liu, Yuenan Hou, Yuxuan Li, Zhen Liu, Yongxiang Liu, and Li Liu. Predicting gradient is better: Exploring self-supervised learning for sar atr with a joint-embedding predictive architecture.ISPRS Journal of Photogrammetry and Remote Sensing, 218:326–338, 2024. 3

work page 2024
[20]

Saratr-x: Towards building a foundation model for sar target recognition.IEEE Transactions on Im- age Processing, 2025

Weijie Li, Wei Yang, Yuenan Hou, Li Liu, Yongxiang Liu, and Xiang Li. Saratr-x: Towards building a foundation model for sar target recognition.IEEE Transactions on Im- age Processing, 2025. 2, 3, 5

work page 2025
[21]

Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.arXiv preprint arXiv:2403.06534, 2024

Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming- Ming Cheng, and Jian Yang. Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.arXiv preprint arXiv:2403.06534, 2024. 2, 3, 4

work page arXiv 2024
[22]

Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 6

work page 2024
[23]

Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 3

work page 2023
[24]

Learning from noisy pseudo- labels for all-weather land cover mapping.arXiv preprint arXiv:2504.13458, 2025

Wang Liu, Zhiyu Wang, Xin Guo, Puhong Duan, Xudong Kang, and Shutao Li. Learning from noisy pseudo- labels for all-weather land cover mapping.arXiv preprint arXiv:2504.13458, 2025. 1

work page arXiv 2025
[25]

Atrnet-star: A large dataset and bench- mark towards remote sensing object recognition in the wild,

Yongxiang Liu, Weijie Li, Li Liu, Jie Zhou, Bowen Peng, Yafei Song, Xuying Xiong, Wei Yang, Tianpeng Liu, Zhen Liu, and Xiang Li. Atrnet-star: A large dataset and bench- mark towards remote sensing object recognition in the wild,

work page
[26]

Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017

Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xue- long Li. Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017. 5

work page 2017
[27]

Sarchat-bench-2m: A multi-task vision-language benchmark for sar image inter- pretation.arXiv preprint arXiv:2502.08168, 2025

Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, and Qingyun Pan. Sarchat-bench-2m: A multi-task vision-language benchmark for sar image inter- pretation.arXiv preprint arXiv:2502.08168, 2025. 2, 3

work page arXiv 2025
[28]

Visualizing data using t-sne.Journal of Machine Learning Research, 9 (Nov):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9 (Nov):2579–2605, 2008. 8

work page 2008
[29]

Re- mote sensing vision-language foundation models without annotations via ground remote alignment.arXiv preprint arXiv:2312.06960, 2023

Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl V ondrick, Bharath Hariharan, and Kavita Bala. Re- mote sensing vision-language foundation models without annotations via ground remote alignment.arXiv preprint arXiv:2312.06960, 2023. 3

work page arXiv 2023
[30]

Improving sar automatic target recognition models with transfer learning from simulated data.IEEE Geoscience and Remote Sensing Letters, 14(9):1484–1488, 2017

David Malmgren-Hansen, Anders Kusk, Jørgen Dall, Al- lan Aasbjerg Nielsen, Rasmus Engholm, and Henning Skriver. Improving sar automatic target recognition models with transfer learning from simulated data.IEEE Geoscience and Remote Sensing Letters, 14(9):1484–1488, 2017. 4

work page 2017
[31]

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440–457. Springer, 2024. 3

work page 2024
[32]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page
[33]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 1, 2, 3

work page 2021
[34]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- man, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 2, 3

work page 2023
[35]

Xian Sun, Yixuan Lv, Zhirui Wang, and Kun Fu. Scan: Scat- tering characteristics analysis network for few-shot aircraft classification in high-resolution sar images.IEEE Transac- tions on Geoscience and Remote Sensing, 60:1–17, 2022. 5

work page 2022
[36]

Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022

Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiao- nan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, et al. Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022. 3

work page 2022
[37]

Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Pro- cessing Systems, 36:20054–20066, 2023

Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, and Hairong Qi. Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Pro- cessing Systems, 36:20054–20066, 2023. 2, 3

work page 2023
[38]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 6

work page 2015
[39]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021. 5

work page arXiv 2021
[40]

Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing

Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 2, 3, 6

work page 2024
[41]

Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.arXiv preprint arXiv:2504.03254, 2025

Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruix- uan Chen, Junshi Xia, and Naoto Yokoya. Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.arXiv preprint arXiv:2504.03254, 2025. 2, 3, 4

work page arXiv 2025
[42]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022. 6

work page 2022
[43]

Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single look complex sar images.IEEE Transactions on Geoscience and Remote Sensing, 2024

Youming Wu, Yuxi Suo, Qingbiao Meng, Wei Dai, Tiao Miao, Wenchao Zhao, Zhiyuan Yan, Wenhui Diao, Guocun Xie, Qingyang Ke, et al. Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single look complex sar images.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 4

work page 2024
[44]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3974– 3983, 2018. 5

work page 2018
[45]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022
[46]

Selo v2: Toward for higher and faster semantic localization.IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023

Miao Yu, Heqiang Yuan, Jialiang Chen, Chongyang Hao, Zhe Wang, Zhiqiang Yuan, and Bin Lu. Selo v2: Toward for higher and faster semantic localization.IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023. 5

work page 2023
[47]

Learning to evaluate performance of multimodal semantic localization.IEEE Transactions on Geoscience and Remote Sensing, 60:1–18, 2022

Zhiqiang Yuan, Wenkai Zhang, Chongyang Li, Zhaoying Pan, Yongqiang Mao, Jialiang Chen, Shuoke Li, Hongqi Wang, and Xian Sun. Learning to evaluate performance of multimodal semantic localization.IEEE Transactions on Geoscience and Remote Sensing, 60:1–18, 2022. 5, 6

work page 2022
[48]

Sar ship detection dataset (ssdd): Offi- cial release and comprehensive data analysis.Remote Sens- ing, 13(18):3690, 2021

Tianwen Zhang, Xiaoling Zhang, Jianwei Li, Xiaowo Xu, Baoyou Wang, Xu Zhan, Yanqin Xu, Xiao Ke, Tianjiao Zeng, Hao Su, et al. Sar ship detection dataset (ssdd): Offi- cial release and comprehensive data analysis.Remote Sens- ing, 13(18):3690, 2021. 1

work page 2021
[49]

Earthgpt: A universal multi-modal large lan- guage model for multi-sensor image comprehension in re- mote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 2024

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multi-modal large lan- guage model for multi-sensor image comprehension in re- mote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3

work page 2024
[50]

Rsar: Restricted state an- gle resolver and rotated sar benchmark.arXiv preprint arXiv:2501.04440, 2025

Xin Zhang, Xue Yang, Yuxuan Li, Jian Yang, Ming- Ming Cheng, and Xiang Li. Rsar: Restricted state an- gle resolver and rotated sar benchmark.arXiv preprint arXiv:2501.04440, 2025. 1

work page arXiv 2025
[51]

Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024

Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 6

work page 2024

[1] [1]

The air force moving and stationary target recog- nition database.https://www.sdms.afrl.af.mil/ index.php?collection=mstar, 1995

AFR Lab. The air force moving and stationary target recog- nition database.https://www.sdms.afrl.af.mil/ index.php?collection=mstar, 1995. 4, 5

work page 1995

[2] [2]

Omnisat: Self-supervised modality fusion for earth observation

Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024. 3

work page 2024

[3] [3]

Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 6

work page 2005

[4] [4]

Tar- get classification using the deep convolutional networks for sar images.IEEE Transactions on Geoscience and Remote Sensing, 54(8):4806–4817, 2016

Sizhe Chen, Haipeng Wang, Feng Xu, and Ya-Qiu Jin. Tar- get classification using the deep convolutional networks for sar images.IEEE Transactions on Geoscience and Remote Sensing, 54(8):4806–4817, 2016. 5

work page 2016

[5] [5]

A simple framework for contrastive learn- ing of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learn- ing of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020. 1

work page 2020

[6] [6]

Reproducible scal- ing laws for contrastive language-image learning

Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 6

work page 2023

[7] [7]

Rouge: A package for automatic evaluation of summaries

Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries. InProceedings of the Workshop on Text Sum- marization Branches Out, 2004, 2004. 6

work page 2004

[8] [8]

Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022

Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022. 3

work page 2022

[9] [9]

Hyperspectral and sar image classification via graph convolutional fusion network.IEEE Transactions on Geoscience and Remote Sensing, 2024

Bin Deng, Puhong Duan, Xukun Lu, Zihao Wang, and Xudong Kang. Hyperspectral and sar image classification via graph convolutional fusion network.IEEE Transactions on Geoscience and Remote Sensing, 2024. 1

work page 2024

[10] [10]

Rethinking remote sensing clip: Lever- aging multimodal large language models for high-quality vision-language dataset

Yiguo He, Junjie Zhu, Yiying Li, Qiangjuan Huang, Zhiyuan Wang, and Ke Yang. Rethinking remote sensing clip: Lever- aging multimodal large language models for high-quality vision-language dataset. InInternational Conference on Neural Information Processing, pages 417–431. Springer,

work page

[11] [11]

Enhancing remote sensing vision-language models through mllm and llm-based high-quality image-text dataset generation.arXiv preprint arXiv:2507.16716, 2025

Yiguo He, Junjie Zhu, Yiying Li, Xiaoyu Zhang, Chunping Qiu, Jun Wang, Qiangjuan Huang, and Ke Yang. Enhancing remote sensing vision-language models through mllm and llm-based high-quality image-text dataset generation.arXiv preprint arXiv:2507.16716, 2025. 6

work page arXiv 2025

[12] [12]

Fusar-ship: Building a high-resolution sar-ais matchup dataset of gaofen-3 for ship detection and recog- nition.Science China Information Sciences, 63(4):140303,

Xiyue Hou, Wei Ao, Qian Song, Jian Lai, Haipeng Wang, and Feng Xu. Fusar-ship: Building a high-resolution sar-ais matchup dataset of gaofen-3 for ship detection and recog- nition.Science China Information Sciences, 63(4):140303,

work page

[13] [13]

Pallavi Jain, Bianca Schoen-Phelan, and Robert Ross. Self- supervised learning for invariant representations from multi- spectral and sar images.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:7797– 7808, 2022. 2

work page 2022

[14] [14]

Sfr-net: Scattering feature relation network for aircraft detection in complex sar images.IEEE Transactions on Geo- science and Remote Sensing, 60:1–17, 2021

Yuzhuo Kang, Zhirui Wang, Jiamei Fu, Xian Sun, and Kun Fu. Sfr-net: Scattering feature relation network for aircraft detection in complex sar images.IEEE Transactions on Geo- science and Remote Sensing, 60:1–17, 2021. 1

work page 2021

[15] [15]

Geochat: Grounded large vision-language model for remote sensing

Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 3

work page 2024

[16] [16]

Synthetic sar image generation using sensor, terrain and target models

Anders Kusk, Adili Abulaitijiang, and Jorgen Dall. Synthetic sar image generation using sensor, terrain and target models. InProceedings of EUSAR 2016: 11th European Conference on Synthetic Aperture Radar, pages 1–5. VDE, 2016. 4

work page 2016

[17] [17]

A sar dataset for atr development: the synthetic and measured paired labeled experiment (sample)

Benjamin Lewis, Theresa Scarnati, Elizabeth Sudkamp, John Nehrbass, Stephen Rosencrantz, and Edmund Zelnio. A sar dataset for atr development: the synthetic and measured paired labeled experiment (sample). InAlgorithms for Syn- thetic Aperture Radar Imagery XXVI, pages 39–54. SPIE,

work page

[18] [18]

Opensarship 2.0: A large-volume dataset for deeper interpretation of ship targets in sentinel-1 imagery

Boying Li, Bin Liu, Lanqing Huang, Weiwei Guo, Zenghui Zhang, and Wenxian Yu. Opensarship 2.0: A large-volume dataset for deeper interpretation of ship targets in sentinel-1 imagery. In2017 SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), pages 1–5. IEEE, 2017. 4

work page 2017

[19] [19]

Weijie Li, Wei Yang, Tianpeng Liu, Yuenan Hou, Yuxuan Li, Zhen Liu, Yongxiang Liu, and Li Liu. Predicting gradient is better: Exploring self-supervised learning for sar atr with a joint-embedding predictive architecture.ISPRS Journal of Photogrammetry and Remote Sensing, 218:326–338, 2024. 3

work page 2024

[20] [20]

Saratr-x: Towards building a foundation model for sar target recognition.IEEE Transactions on Im- age Processing, 2025

Weijie Li, Wei Yang, Yuenan Hou, Li Liu, Yongxiang Liu, and Xiang Li. Saratr-x: Towards building a foundation model for sar target recognition.IEEE Transactions on Im- age Processing, 2025. 2, 3, 5

work page 2025

[21] [21]

Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.arXiv preprint arXiv:2403.06534, 2024

Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming- Ming Cheng, and Jian Yang. Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.arXiv preprint arXiv:2403.06534, 2024. 2, 3, 4

work page arXiv 2024

[22] [22]

Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024

Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 6

work page 2024

[23] [23]

Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 3

work page 2023

[24] [24]

Learning from noisy pseudo- labels for all-weather land cover mapping.arXiv preprint arXiv:2504.13458, 2025

Wang Liu, Zhiyu Wang, Xin Guo, Puhong Duan, Xudong Kang, and Shutao Li. Learning from noisy pseudo- labels for all-weather land cover mapping.arXiv preprint arXiv:2504.13458, 2025. 1

work page arXiv 2025

[25] [25]

Atrnet-star: A large dataset and bench- mark towards remote sensing object recognition in the wild,

Yongxiang Liu, Weijie Li, Li Liu, Jie Zhou, Bowen Peng, Yafei Song, Xuying Xiong, Wei Yang, Tianpeng Liu, Zhen Liu, and Xiang Li. Atrnet-star: A large dataset and bench- mark towards remote sensing object recognition in the wild,

work page

[26] [26]

Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017

Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xue- long Li. Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017. 5

work page 2017

[27] [27]

Sarchat-bench-2m: A multi-task vision-language benchmark for sar image inter- pretation.arXiv preprint arXiv:2502.08168, 2025

Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, and Qingyun Pan. Sarchat-bench-2m: A multi-task vision-language benchmark for sar image inter- pretation.arXiv preprint arXiv:2502.08168, 2025. 2, 3

work page arXiv 2025

[28] [28]

Visualizing data using t-sne.Journal of Machine Learning Research, 9 (Nov):2579–2605, 2008

Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9 (Nov):2579–2605, 2008. 8

work page 2008

[29] [29]

Re- mote sensing vision-language foundation models without annotations via ground remote alignment.arXiv preprint arXiv:2312.06960, 2023

Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl V ondrick, Bharath Hariharan, and Kavita Bala. Re- mote sensing vision-language foundation models without annotations via ground remote alignment.arXiv preprint arXiv:2312.06960, 2023. 3

work page arXiv 2023

[30] [30]

Improving sar automatic target recognition models with transfer learning from simulated data.IEEE Geoscience and Remote Sensing Letters, 14(9):1484–1488, 2017

David Malmgren-Hansen, Anders Kusk, Jørgen Dall, Al- lan Aasbjerg Nielsen, Rasmus Engholm, and Henning Skriver. Improving sar automatic target recognition models with transfer learning from simulated data.IEEE Geoscience and Remote Sensing Letters, 14(9):1484–1488, 2017. 4

work page 2017

[31] [31]

Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440–457. Springer, 2024. 3

work page 2024

[32] [32]

Bleu: a method for automatic evaluation of machine translation

Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

work page

[33] [33]

Learn- ing transferable visual models from natural language super- vision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 1, 2, 3

work page 2021

[34] [34]

Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- man, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 2, 3

work page 2023

[35] [35]

Xian Sun, Yixuan Lv, Zhirui Wang, and Kun Fu. Scan: Scat- tering characteristics analysis network for few-shot aircraft classification in high-resolution sar images.IEEE Transac- tions on Geoscience and Remote Sensing, 60:1–17, 2022. 5

work page 2022

[36] [36]

Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022

Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiao- nan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, et al. Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022. 3

work page 2022

[37] [37]

Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Pro- cessing Systems, 36:20054–20066, 2023

Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, and Hairong Qi. Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Pro- cessing Systems, 36:20054–20066, 2023. 2, 3

work page 2023

[38] [38]

Cider: Consensus-based image description evalua- tion

Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 6

work page 2015

[39] [39]

Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021

Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021. 5

work page arXiv 2021

[40] [40]

Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing

Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 2, 3, 6

work page 2024

[41] [41]

Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.arXiv preprint arXiv:2504.03254, 2025

Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruix- uan Chen, Junshi Xia, and Naoto Yokoya. Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.arXiv preprint arXiv:2504.03254, 2025. 2, 3, 4

work page arXiv 2025

[42] [42]

Robust fine-tuning of zero-shot models

Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022. 6

work page 2022

[43] [43]

Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single look complex sar images.IEEE Transactions on Geoscience and Remote Sensing, 2024

Youming Wu, Yuxi Suo, Qingbiao Meng, Wei Dai, Tiao Miao, Wenchao Zhao, Zhiyuan Yan, Wenhui Diao, Guocun Xie, Qingyang Ke, et al. Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single look complex sar images.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 4

work page 2024

[44] [44]

Dota: A large-scale dataset for object detection in aerial images

Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3974– 3983, 2018. 5

work page 2018

[45] [45]

CoCa: Contrastive Captioners are Image-Text Foundation Models

Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. 5

work page internal anchor Pith review Pith/arXiv arXiv 2022

[46] [46]

Selo v2: Toward for higher and faster semantic localization.IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023

Miao Yu, Heqiang Yuan, Jialiang Chen, Chongyang Hao, Zhe Wang, Zhiqiang Yuan, and Bin Lu. Selo v2: Toward for higher and faster semantic localization.IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023. 5

work page 2023

[47] [47]

Learning to evaluate performance of multimodal semantic localization.IEEE Transactions on Geoscience and Remote Sensing, 60:1–18, 2022

Zhiqiang Yuan, Wenkai Zhang, Chongyang Li, Zhaoying Pan, Yongqiang Mao, Jialiang Chen, Shuoke Li, Hongqi Wang, and Xian Sun. Learning to evaluate performance of multimodal semantic localization.IEEE Transactions on Geoscience and Remote Sensing, 60:1–18, 2022. 5, 6

work page 2022

[48] [48]

Sar ship detection dataset (ssdd): Offi- cial release and comprehensive data analysis.Remote Sens- ing, 13(18):3690, 2021

Tianwen Zhang, Xiaoling Zhang, Jianwei Li, Xiaowo Xu, Baoyou Wang, Xu Zhan, Yanqin Xu, Xiao Ke, Tianjiao Zeng, Hao Su, et al. Sar ship detection dataset (ssdd): Offi- cial release and comprehensive data analysis.Remote Sens- ing, 13(18):3690, 2021. 1

work page 2021

[49] [49]

Earthgpt: A universal multi-modal large lan- guage model for multi-sensor image comprehension in re- mote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 2024

Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multi-modal large lan- guage model for multi-sensor image comprehension in re- mote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3

work page 2024

[50] [50]

Rsar: Restricted state an- gle resolver and rotated sar benchmark.arXiv preprint arXiv:2501.04440, 2025

Xin Zhang, Xue Yang, Yuxuan Li, Jian Yang, Ming- Ming Cheng, and Xiang Li. Rsar: Restricted state an- gle resolver and rotated sar benchmark.arXiv preprint arXiv:2501.04440, 2025. 1

work page arXiv 2025

[51] [51]

Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024

Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 6

work page 2024