pith. sign in

arxiv: 2510.22665 · v3 · pith:E6YIS3XWnew · submitted 2025-10-26 · 💻 cs.CV · cs.AI

SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery

Pith reviewed 2026-05-21 20:54 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords SAR imageryvision-language modelfoundation modeldomain transferimage-text retrievalsemantic understandingremote sensingSARVLM-1M
0
0 comments X

The pith

SARVLM is the first vision-language foundation model for SAR imagery, built with a million-scale dataset and optical remote sensing data as a bridge to transfer knowledge from natural images.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to create an effective vision-language model for SAR images, which have been held back by scarce paired text data and large visual differences from natural scenes. It assembles SARVLM-1M with over one million image-text pairs and introduces a two-stage training process that routes learning through optical remote sensing images. This matters because SAR supplies reliable all-weather observation for monitoring, defense, and disaster response, and stronger cross-modal understanding would let systems retrieve, detect, and describe SAR content without task-specific retraining. If the approach works, it shows that indirect domain transfer can close the gap between everyday images and radar returns.

Core claim

SARVLM, consisting of SARCLIP and SARCap, is developed on the SARVLM-1M dataset of more than one million image-text pairs through a two-stage domain transfer training strategy that uses optical remote sensing data as an intermediate bridge to move knowledge from natural images into the SAR domain; an ensemble strategy further improves cross-scene generalization. The resulting model, together with SARDet and SARRot extensions, produces stronger feature extraction and interpretation than prior vision-language models across thirteen benchmarks covering image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning.

What carries the argument

Two-stage domain transfer training strategy that treats optical remote sensing imagery as an intermediate bridge to adapt natural-image knowledge to SAR.

If this is right

  • SARVLM improves accuracy on image-text retrieval, zero-shot classification, and image captioning for SAR scenes.
  • The same framework yields stronger object detection results when instantiated as SARDet and SARRot.
  • The ensemble component increases robustness across different imaging scenes and conditions.
  • Semantic localization and target recognition tasks become more reliable without task-specific fine-tuning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The optical-bridge technique could be tested on other radar-like modalities such as sonar or ground-penetrating radar where paired text data is also scarce.
  • Operational pipelines that fuse live SAR streams with text queries might become feasible once the model is quantized for edge hardware.
  • The dataset-construction method offers a template for building multimodal corpora in other remote-sensing domains that lack direct text labels.

Load-bearing premise

Optical remote sensing data can serve as an effective intermediate bridge to transfer knowledge from natural images to the SAR domain despite the substantial differences between SAR and natural imagery.

What would settle it

Retraining the same architecture on SARVLM-1M without the optical remote sensing bridge stage and measuring whether the performance advantage over existing vision-language models on the thirteen benchmarks disappears or reverses.

Figures

Figures reproduced from arXiv: 2510.22665 by Puhong Duan, Qiwei Ma, Shutao Li, Wang Liu, Xudong Kang, Xukun Lu.

Figure 1
Figure 1. Figure 1: Comparison of image-text datasets in remote sensing. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The paradigm for foundation model: (a) CL-based methods, (b) MIM-based methods, (c) CLIP-based methods. [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: (a) Examples from the SARVLM-1M dataset; (b) Two-stage domain transfer training strategy for SARCLIP. [PITH_FULL_IMAGE:figures/full_fig_p003_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: The framework of SARCap method. 3.1. Problem Definition In this section, we investigate the paradigm of learning joint representations from SAR images and their corresponding textual descriptions. Specifically, we construct SARVLM￾1M dataset D = {(Ii , Ti)}M i=1 consisting of SAR images Ii ∈ RH×W with the corresponding descriptions Ti ∈ T . As shown in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 6
Figure 6. Figure 6: Feature space visualization of SARCLIP‡ image encoder on three downstream datasets (ViT-L-14). domain. 4.4.2. Target recognition results on MSTAR-SOC and SAR-VSA dataset [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Visualization of results on semantic localization task [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
Figure 9
Figure 9. Figure 9: Ablation study on training layers of SARCLIP [PITH_FULL_IMAGE:figures/full_fig_p008_9.png] view at source ↗
read the original abstract

Synthetic Aperture Radar (SAR) is a critical imaging modality due to its all-weather operational capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these approaches primarily focus on low-level visual features and often neglect multi-modal representation. Moreover, multimodal data for SAR is scarce, limiting the development of robust cross-modal models. To address this limitation, we construct SARVLM-1M, a large-scale vision-language dataset comprising over one million image-text pairs aggregated from existing datasets. Furthermore, to mitigate the substantial differences between SAR and natural imagery, we propose a two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge, facilitating effective knowledge transfer from natural images to SAR domains. Based on this strategy, we develop SARVLM, the first vision-language foundation model tailored for SAR, consisting of SARCLIP and SARCap. In addition, an ensemble strategy is utilized to improve the cross-scene generalization capability of the model. Moreover, SARDet and SARRot further validate the capability of the proposed framework in object detection. Extensive experiments on 13 benchmarks across image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning demonstrate the superior feature extraction and interpretation capabilities of SARVLM. It consistently outperforms state-of-the-art vision-language models and advances semantic understanding in SAR imagery. Code and datasets will be released on https://github.com/KlayMa527/SARVLM.git.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces SARVLM, the first vision-language foundation model for SAR imagery. It constructs the SARVLM-1M dataset comprising over one million image-text pairs and proposes a two-stage domain transfer strategy that uses optical remote sensing data as an intermediate bridge to transfer knowledge from natural images to SAR. The model consists of SARCLIP and SARCap components, incorporates an ensemble strategy for cross-scene generalization, and includes SARDet and SARRot for object detection validation. Extensive experiments claim consistent outperformance over state-of-the-art vision-language models across 13 benchmarks spanning image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning.

Significance. If the performance claims are supported by rigorous ablations, statistical tests, and clear baseline comparisons, this work would represent a meaningful advance in multimodal semantic understanding for SAR imagery, addressing the scarcity of SAR vision-language data and the domain gap with natural images. The construction and planned release of the SARVLM-1M dataset would provide a valuable resource for the community. The two-stage transfer approach, if validated, could offer a practical template for domain adaptation in remote sensing modalities.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Two-stage domain transfer): The central claim that the two-stage strategy (natural images → optical remote sensing → SAR) mitigates substantial SAR-natural differences and drives the reported gains is load-bearing for explaining outperformance over SOTA VLMs, yet no ablation studies isolate its contribution versus single-stage direct adaptation, dataset scale alone, or architecture choices in SARCLIP/SARCap. This leaves the effectiveness of the optical RS bridge as an untested assumption.
  2. [§4 (Experiments)] §4 (Experiments) and associated tables: The assertion of consistent outperformance on 13 benchmarks lacks reported details on exact baselines, data splits, statistical significance testing, or variance across runs. Without these, it is impossible to assess whether gains are robust or influenced by post-hoc choices, undermining the cross-task superiority claim.
minor comments (2)
  1. [§3] The notation for SARCLIP and SARCap components could be clarified with explicit architectural diagrams or equations showing how they differ from standard CLIP and captioning heads.
  2. [§4] Ensure all benchmark results include the number of runs or error bars to support reproducibility claims.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed review. We address each major comment below and outline the revisions we will make to improve the rigor and transparency of the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Two-stage domain transfer): The central claim that the two-stage strategy (natural images → optical remote sensing → SAR) mitigates substantial SAR-natural differences and drives the reported gains is load-bearing for explaining outperformance over SOTA VLMs, yet no ablation studies isolate its contribution versus single-stage direct adaptation, dataset scale alone, or architecture choices in SARCLIP/SARCap. This leaves the effectiveness of the optical RS bridge as an untested assumption.

    Authors: We agree that dedicated ablations are required to substantiate the contribution of the two-stage domain transfer. The current manuscript motivates the optical remote sensing bridge based on the domain gap but does not isolate its effect. In the revised manuscript we will add ablation experiments in §3 and §4 that compare (i) the full two-stage pipeline against direct single-stage adaptation from natural images to SAR, (ii) performance at different dataset scales, and (iii) variations in SARCLIP/SARCap architecture while keeping the transfer strategy fixed. These results will be presented with quantitative deltas to clarify the incremental benefit of the intermediate optical RS stage. revision: yes

  2. Referee: [§4 (Experiments)] §4 (Experiments) and associated tables: The assertion of consistent outperformance on 13 benchmarks lacks reported details on exact baselines, data splits, statistical significance testing, or variance across runs. Without these, it is impossible to assess whether gains are robust or influenced by post-hoc choices, undermining the cross-task superiority claim.

    Authors: We acknowledge that the experimental section requires additional detail for reproducibility and statistical rigor. The revised §4 and tables will specify: (a) exact baseline models with citations and implementation details, (b) the precise train/validation/test splits used for each of the 13 benchmarks, (c) statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) comparing SARVLM against baselines, and (d) mean and standard deviation of key metrics across at least three random seeds. These additions will allow readers to evaluate the robustness of the reported improvements. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmarks validate proposed two-stage transfer

full rationale

The paper constructs the SARVLM-1M dataset and proposes a two-stage training strategy that uses optical remote sensing data as an intermediate bridge for knowledge transfer from natural images to SAR. It then trains SARCLIP and SARCap components and reports outperformance on 13 empirical benchmarks for retrieval, classification, detection, and captioning. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs or to self-citations. The central claims rest on experimental results rather than self-definitional loops, renamed known results, or load-bearing self-citations. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 3 invented entities

The central claim rests on the effectiveness of the two-stage domain transfer and the representativeness of the aggregated SARVLM-1M dataset; many training hyperparameters and exact aggregation rules are not specified in the abstract.

free parameters (1)
  • Two-stage domain transfer hyperparameters
    Learning rates, epochs, and loss weights for the intermediate optical bridge stage are not detailed but must be chosen or fitted to achieve the reported transfer.
axioms (1)
  • domain assumption Optical remote sensing imagery shares sufficient structural properties with both natural images and SAR to act as an effective knowledge-transfer bridge.
    Invoked directly in the description of the two-stage training strategy to mitigate differences between SAR and natural imagery.
invented entities (3)
  • SARVLM-1M no independent evidence
    purpose: Large-scale vision-language training dataset for SAR
    Aggregated from existing datasets; no independent validation of quality or coverage is provided in the abstract.
  • SARCLIP no independent evidence
    purpose: SAR-adapted image-text alignment model
    Core component of the proposed SARVLM framework.
  • SARCap no independent evidence
    purpose: SAR image captioning module
    Core component of the proposed SARVLM framework.

pith-pipeline@v0.9.0 · 5818 in / 1610 out tokens · 54754 ms · 2026-05-21T20:54:58.018580+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model

    eess.IV 2026-05 unverdicted novelty 6.0

    Introduces the SMART-HC-VQA dataset with 65k single-image and 2.3M temporal VQA examples plus an adapted LLaVA-NeXT MLLM framework for geospatial-temporal sensemaking of remote sensing construction activity.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper · 1 internal anchor

  1. [1]

    The air force moving and stationary target recog- nition database.https://www.sdms.afrl.af.mil/ index.php?collection=mstar, 1995

    AFR Lab. The air force moving and stationary target recog- nition database.https://www.sdms.afrl.af.mil/ index.php?collection=mstar, 1995. 4, 5

  2. [2]

    Omnisat: Self-supervised modality fusion for earth observation

    Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024. 3

  3. [3]

    Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments

    Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 6

  4. [4]

    Tar- get classification using the deep convolutional networks for sar images.IEEE Transactions on Geoscience and Remote Sensing, 54(8):4806–4817, 2016

    Sizhe Chen, Haipeng Wang, Feng Xu, and Ya-Qiu Jin. Tar- get classification using the deep convolutional networks for sar images.IEEE Transactions on Geoscience and Remote Sensing, 54(8):4806–4817, 2016. 5

  5. [5]

    A simple framework for contrastive learn- ing of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learn- ing of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020. 1

  6. [6]

    Reproducible scal- ing laws for contrastive language-image learning

    Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 6

  7. [7]

    Rouge: A package for automatic evaluation of summaries

    Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries. InProceedings of the Workshop on Text Sum- marization Branches Out, 2004, 2004. 6

  8. [8]

    Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022

    Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022. 3

  9. [9]

    Hyperspectral and sar image classification via graph convolutional fusion network.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Bin Deng, Puhong Duan, Xukun Lu, Zihao Wang, and Xudong Kang. Hyperspectral and sar image classification via graph convolutional fusion network.IEEE Transactions on Geoscience and Remote Sensing, 2024. 1

  10. [10]

    Rethinking remote sensing clip: Lever- aging multimodal large language models for high-quality vision-language dataset

    Yiguo He, Junjie Zhu, Yiying Li, Qiangjuan Huang, Zhiyuan Wang, and Ke Yang. Rethinking remote sensing clip: Lever- aging multimodal large language models for high-quality vision-language dataset. InInternational Conference on Neural Information Processing, pages 417–431. Springer,

  11. [11]

    Enhancing remote sensing vision-language models through mllm and llm-based high-quality image-text dataset generation.arXiv preprint arXiv:2507.16716, 2025

    Yiguo He, Junjie Zhu, Yiying Li, Xiaoyu Zhang, Chunping Qiu, Jun Wang, Qiangjuan Huang, and Ke Yang. Enhancing remote sensing vision-language models through mllm and llm-based high-quality image-text dataset generation.arXiv preprint arXiv:2507.16716, 2025. 6

  12. [12]

    Fusar-ship: Building a high-resolution sar-ais matchup dataset of gaofen-3 for ship detection and recog- nition.Science China Information Sciences, 63(4):140303,

    Xiyue Hou, Wei Ao, Qian Song, Jian Lai, Haipeng Wang, and Feng Xu. Fusar-ship: Building a high-resolution sar-ais matchup dataset of gaofen-3 for ship detection and recog- nition.Science China Information Sciences, 63(4):140303,

  13. [13]

    Pallavi Jain, Bianca Schoen-Phelan, and Robert Ross. Self- supervised learning for invariant representations from multi- spectral and sar images.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:7797– 7808, 2022. 2

  14. [14]

    Sfr-net: Scattering feature relation network for aircraft detection in complex sar images.IEEE Transactions on Geo- science and Remote Sensing, 60:1–17, 2021

    Yuzhuo Kang, Zhirui Wang, Jiamei Fu, Xian Sun, and Kun Fu. Sfr-net: Scattering feature relation network for aircraft detection in complex sar images.IEEE Transactions on Geo- science and Remote Sensing, 60:1–17, 2021. 1

  15. [15]

    Geochat: Grounded large vision-language model for remote sensing

    Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 3

  16. [16]

    Synthetic sar image generation using sensor, terrain and target models

    Anders Kusk, Adili Abulaitijiang, and Jorgen Dall. Synthetic sar image generation using sensor, terrain and target models. InProceedings of EUSAR 2016: 11th European Conference on Synthetic Aperture Radar, pages 1–5. VDE, 2016. 4

  17. [17]

    A sar dataset for atr development: the synthetic and measured paired labeled experiment (sample)

    Benjamin Lewis, Theresa Scarnati, Elizabeth Sudkamp, John Nehrbass, Stephen Rosencrantz, and Edmund Zelnio. A sar dataset for atr development: the synthetic and measured paired labeled experiment (sample). InAlgorithms for Syn- thetic Aperture Radar Imagery XXVI, pages 39–54. SPIE,

  18. [18]

    Opensarship 2.0: A large-volume dataset for deeper interpretation of ship targets in sentinel-1 imagery

    Boying Li, Bin Liu, Lanqing Huang, Weiwei Guo, Zenghui Zhang, and Wenxian Yu. Opensarship 2.0: A large-volume dataset for deeper interpretation of ship targets in sentinel-1 imagery. In2017 SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), pages 1–5. IEEE, 2017. 4

  19. [19]

    Weijie Li, Wei Yang, Tianpeng Liu, Yuenan Hou, Yuxuan Li, Zhen Liu, Yongxiang Liu, and Li Liu. Predicting gradient is better: Exploring self-supervised learning for sar atr with a joint-embedding predictive architecture.ISPRS Journal of Photogrammetry and Remote Sensing, 218:326–338, 2024. 3

  20. [20]

    Saratr-x: Towards building a foundation model for sar target recognition.IEEE Transactions on Im- age Processing, 2025

    Weijie Li, Wei Yang, Yuenan Hou, Li Liu, Yongxiang Liu, and Xiang Li. Saratr-x: Towards building a foundation model for sar target recognition.IEEE Transactions on Im- age Processing, 2025. 2, 3, 5

  21. [21]

    Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.arXiv preprint arXiv:2403.06534, 2024

    Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming- Ming Cheng, and Jian Yang. Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.arXiv preprint arXiv:2403.06534, 2024. 2, 3, 4

  22. [22]

    Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 6

  23. [23]

    Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 3

  24. [24]

    Learning from noisy pseudo- labels for all-weather land cover mapping.arXiv preprint arXiv:2504.13458, 2025

    Wang Liu, Zhiyu Wang, Xin Guo, Puhong Duan, Xudong Kang, and Shutao Li. Learning from noisy pseudo- labels for all-weather land cover mapping.arXiv preprint arXiv:2504.13458, 2025. 1

  25. [25]

    Atrnet-star: A large dataset and bench- mark towards remote sensing object recognition in the wild,

    Yongxiang Liu, Weijie Li, Li Liu, Jie Zhou, Bowen Peng, Yafei Song, Xuying Xiong, Wei Yang, Tianpeng Liu, Zhen Liu, and Xiang Li. Atrnet-star: A large dataset and bench- mark towards remote sensing object recognition in the wild,

  26. [26]

    Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017

    Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xue- long Li. Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017. 5

  27. [27]

    Sarchat-bench-2m: A multi-task vision-language benchmark for sar image inter- pretation.arXiv preprint arXiv:2502.08168, 2025

    Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, and Qingyun Pan. Sarchat-bench-2m: A multi-task vision-language benchmark for sar image inter- pretation.arXiv preprint arXiv:2502.08168, 2025. 2, 3

  28. [28]

    Visualizing data using t-sne.Journal of Machine Learning Research, 9 (Nov):2579–2605, 2008

    Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9 (Nov):2579–2605, 2008. 8

  29. [29]

    Re- mote sensing vision-language foundation models without annotations via ground remote alignment.arXiv preprint arXiv:2312.06960, 2023

    Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl V ondrick, Bharath Hariharan, and Kavita Bala. Re- mote sensing vision-language foundation models without annotations via ground remote alignment.arXiv preprint arXiv:2312.06960, 2023. 3

  30. [30]

    Improving sar automatic target recognition models with transfer learning from simulated data.IEEE Geoscience and Remote Sensing Letters, 14(9):1484–1488, 2017

    David Malmgren-Hansen, Anders Kusk, Jørgen Dall, Al- lan Aasbjerg Nielsen, Rasmus Engholm, and Henning Skriver. Improving sar automatic target recognition models with transfer learning from simulated data.IEEE Geoscience and Remote Sensing Letters, 14(9):1484–1488, 2017. 4

  31. [31]

    Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model

    Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440–457. Springer, 2024. 3

  32. [32]

    Bleu: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,

  33. [33]

    Learn- ing transferable visual models from natural language super- vision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 1, 2, 3

  34. [34]

    Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning

    Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- man, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 2, 3

  35. [35]

    Xian Sun, Yixuan Lv, Zhirui Wang, and Kun Fu. Scan: Scat- tering characteristics analysis network for few-shot aircraft classification in high-resolution sar images.IEEE Transac- tions on Geoscience and Remote Sensing, 60:1–17, 2022. 5

  36. [36]

    Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022

    Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiao- nan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, et al. Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022. 3

  37. [37]

    Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Pro- cessing Systems, 36:20054–20066, 2023

    Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, and Hairong Qi. Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Pro- cessing Systems, 36:20054–20066, 2023. 2, 3

  38. [38]

    Cider: Consensus-based image description evalua- tion

    Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 6

  39. [39]

    Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021

    Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021. 5

  40. [40]

    Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing

    Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 2, 3, 6

  41. [41]

    Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.arXiv preprint arXiv:2504.03254, 2025

    Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruix- uan Chen, Junshi Xia, and Naoto Yokoya. Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.arXiv preprint arXiv:2504.03254, 2025. 2, 3, 4

  42. [42]

    Robust fine-tuning of zero-shot models

    Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022. 6

  43. [43]

    Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single look complex sar images.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Youming Wu, Yuxi Suo, Qingbiao Meng, Wei Dai, Tiao Miao, Wenchao Zhao, Zhiyuan Yan, Wenhui Diao, Guocun Xie, Qingyang Ke, et al. Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single look complex sar images.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 4

  44. [44]

    Dota: A large-scale dataset for object detection in aerial images

    Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3974– 3983, 2018. 5

  45. [45]

    CoCa: Contrastive Captioners are Image-Text Foundation Models

    Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. 5

  46. [46]

    Selo v2: Toward for higher and faster semantic localization.IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023

    Miao Yu, Heqiang Yuan, Jialiang Chen, Chongyang Hao, Zhe Wang, Zhiqiang Yuan, and Bin Lu. Selo v2: Toward for higher and faster semantic localization.IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023. 5

  47. [47]

    Learning to evaluate performance of multimodal semantic localization.IEEE Transactions on Geoscience and Remote Sensing, 60:1–18, 2022

    Zhiqiang Yuan, Wenkai Zhang, Chongyang Li, Zhaoying Pan, Yongqiang Mao, Jialiang Chen, Shuoke Li, Hongqi Wang, and Xian Sun. Learning to evaluate performance of multimodal semantic localization.IEEE Transactions on Geoscience and Remote Sensing, 60:1–18, 2022. 5, 6

  48. [48]

    Sar ship detection dataset (ssdd): Offi- cial release and comprehensive data analysis.Remote Sens- ing, 13(18):3690, 2021

    Tianwen Zhang, Xiaoling Zhang, Jianwei Li, Xiaowo Xu, Baoyou Wang, Xu Zhan, Yanqin Xu, Xiao Ke, Tianjiao Zeng, Hao Su, et al. Sar ship detection dataset (ssdd): Offi- cial release and comprehensive data analysis.Remote Sens- ing, 13(18):3690, 2021. 1

  49. [49]

    Earthgpt: A universal multi-modal large lan- guage model for multi-sensor image comprehension in re- mote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multi-modal large lan- guage model for multi-sensor image comprehension in re- mote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3

  50. [50]

    Rsar: Restricted state an- gle resolver and rotated sar benchmark.arXiv preprint arXiv:2501.04440, 2025

    Xin Zhang, Xue Yang, Yuxuan Li, Jian Yang, Ming- Ming Cheng, and Xiang Li. Rsar: Restricted state an- gle resolver and rotated sar benchmark.arXiv preprint arXiv:2501.04440, 2025. 1

  51. [51]

    Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024

    Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 6