SARVLM: A Vision Language Foundation Model for Semantic Understanding in SAR Imagery
Pith reviewed 2026-05-21 20:54 UTC · model grok-4.3
The pith
SARVLM is the first vision-language foundation model for SAR imagery, built with a million-scale dataset and optical remote sensing data as a bridge to transfer knowledge from natural images.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
SARVLM, consisting of SARCLIP and SARCap, is developed on the SARVLM-1M dataset of more than one million image-text pairs through a two-stage domain transfer training strategy that uses optical remote sensing data as an intermediate bridge to move knowledge from natural images into the SAR domain; an ensemble strategy further improves cross-scene generalization. The resulting model, together with SARDet and SARRot extensions, produces stronger feature extraction and interpretation than prior vision-language models across thirteen benchmarks covering image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning.
What carries the argument
Two-stage domain transfer training strategy that treats optical remote sensing imagery as an intermediate bridge to adapt natural-image knowledge to SAR.
If this is right
- SARVLM improves accuracy on image-text retrieval, zero-shot classification, and image captioning for SAR scenes.
- The same framework yields stronger object detection results when instantiated as SARDet and SARRot.
- The ensemble component increases robustness across different imaging scenes and conditions.
- Semantic localization and target recognition tasks become more reliable without task-specific fine-tuning.
Where Pith is reading between the lines
- The optical-bridge technique could be tested on other radar-like modalities such as sonar or ground-penetrating radar where paired text data is also scarce.
- Operational pipelines that fuse live SAR streams with text queries might become feasible once the model is quantized for edge hardware.
- The dataset-construction method offers a template for building multimodal corpora in other remote-sensing domains that lack direct text labels.
Load-bearing premise
Optical remote sensing data can serve as an effective intermediate bridge to transfer knowledge from natural images to the SAR domain despite the substantial differences between SAR and natural imagery.
What would settle it
Retraining the same architecture on SARVLM-1M without the optical remote sensing bridge stage and measuring whether the performance advantage over existing vision-language models on the thirteen benchmarks disappears or reverses.
Figures
read the original abstract
Synthetic Aperture Radar (SAR) is a critical imaging modality due to its all-weather operational capability. Although recent advances in self-supervised learning and masked image modeling (MIM) have enabled SAR foundation models, these approaches primarily focus on low-level visual features and often neglect multi-modal representation. Moreover, multimodal data for SAR is scarce, limiting the development of robust cross-modal models. To address this limitation, we construct SARVLM-1M, a large-scale vision-language dataset comprising over one million image-text pairs aggregated from existing datasets. Furthermore, to mitigate the substantial differences between SAR and natural imagery, we propose a two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge, facilitating effective knowledge transfer from natural images to SAR domains. Based on this strategy, we develop SARVLM, the first vision-language foundation model tailored for SAR, consisting of SARCLIP and SARCap. In addition, an ensemble strategy is utilized to improve the cross-scene generalization capability of the model. Moreover, SARDet and SARRot further validate the capability of the proposed framework in object detection. Extensive experiments on 13 benchmarks across image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning demonstrate the superior feature extraction and interpretation capabilities of SARVLM. It consistently outperforms state-of-the-art vision-language models and advances semantic understanding in SAR imagery. Code and datasets will be released on https://github.com/KlayMa527/SARVLM.git.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces SARVLM, the first vision-language foundation model for SAR imagery. It constructs the SARVLM-1M dataset comprising over one million image-text pairs and proposes a two-stage domain transfer strategy that uses optical remote sensing data as an intermediate bridge to transfer knowledge from natural images to SAR. The model consists of SARCLIP and SARCap components, incorporates an ensemble strategy for cross-scene generalization, and includes SARDet and SARRot for object detection validation. Extensive experiments claim consistent outperformance over state-of-the-art vision-language models across 13 benchmarks spanning image-text retrieval, target recognition, zero-shot classification, object detection, semantic localization, and image captioning.
Significance. If the performance claims are supported by rigorous ablations, statistical tests, and clear baseline comparisons, this work would represent a meaningful advance in multimodal semantic understanding for SAR imagery, addressing the scarcity of SAR vision-language data and the domain gap with natural images. The construction and planned release of the SARVLM-1M dataset would provide a valuable resource for the community. The two-stage transfer approach, if validated, could offer a practical template for domain adaptation in remote sensing modalities.
major comments (2)
- [Abstract and §3] Abstract and §3 (Two-stage domain transfer): The central claim that the two-stage strategy (natural images → optical remote sensing → SAR) mitigates substantial SAR-natural differences and drives the reported gains is load-bearing for explaining outperformance over SOTA VLMs, yet no ablation studies isolate its contribution versus single-stage direct adaptation, dataset scale alone, or architecture choices in SARCLIP/SARCap. This leaves the effectiveness of the optical RS bridge as an untested assumption.
- [§4 (Experiments)] §4 (Experiments) and associated tables: The assertion of consistent outperformance on 13 benchmarks lacks reported details on exact baselines, data splits, statistical significance testing, or variance across runs. Without these, it is impossible to assess whether gains are robust or influenced by post-hoc choices, undermining the cross-task superiority claim.
minor comments (2)
- [§3] The notation for SARCLIP and SARCap components could be clarified with explicit architectural diagrams or equations showing how they differ from standard CLIP and captioning heads.
- [§4] Ensure all benchmark results include the number of runs or error bars to support reproducibility claims.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed review. We address each major comment below and outline the revisions we will make to improve the rigor and transparency of the manuscript.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Two-stage domain transfer): The central claim that the two-stage strategy (natural images → optical remote sensing → SAR) mitigates substantial SAR-natural differences and drives the reported gains is load-bearing for explaining outperformance over SOTA VLMs, yet no ablation studies isolate its contribution versus single-stage direct adaptation, dataset scale alone, or architecture choices in SARCLIP/SARCap. This leaves the effectiveness of the optical RS bridge as an untested assumption.
Authors: We agree that dedicated ablations are required to substantiate the contribution of the two-stage domain transfer. The current manuscript motivates the optical remote sensing bridge based on the domain gap but does not isolate its effect. In the revised manuscript we will add ablation experiments in §3 and §4 that compare (i) the full two-stage pipeline against direct single-stage adaptation from natural images to SAR, (ii) performance at different dataset scales, and (iii) variations in SARCLIP/SARCap architecture while keeping the transfer strategy fixed. These results will be presented with quantitative deltas to clarify the incremental benefit of the intermediate optical RS stage. revision: yes
-
Referee: [§4 (Experiments)] §4 (Experiments) and associated tables: The assertion of consistent outperformance on 13 benchmarks lacks reported details on exact baselines, data splits, statistical significance testing, or variance across runs. Without these, it is impossible to assess whether gains are robust or influenced by post-hoc choices, undermining the cross-task superiority claim.
Authors: We acknowledge that the experimental section requires additional detail for reproducibility and statistical rigor. The revised §4 and tables will specify: (a) exact baseline models with citations and implementation details, (b) the precise train/validation/test splits used for each of the 13 benchmarks, (c) statistical significance tests (e.g., paired t-tests or Wilcoxon signed-rank tests with p-values) comparing SARVLM against baselines, and (d) mean and standard deviation of key metrics across at least three random seeds. These additions will allow readers to evaluate the robustness of the reported improvements. revision: yes
Circularity Check
No circularity: empirical benchmarks validate proposed two-stage transfer
full rationale
The paper constructs the SARVLM-1M dataset and proposes a two-stage training strategy that uses optical remote sensing data as an intermediate bridge for knowledge transfer from natural images to SAR. It then trains SARCLIP and SARCap components and reports outperformance on 13 empirical benchmarks for retrieval, classification, detection, and captioning. No equations, fitted parameters, or predictions are presented that reduce by construction to the inputs or to self-citations. The central claims rest on experimental results rather than self-definitional loops, renamed known results, or load-bearing self-citations. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (1)
- Two-stage domain transfer hyperparameters
axioms (1)
- domain assumption Optical remote sensing imagery shares sufficient structural properties with both natural images and SAR to act as an effective knowledge-transfer bridge.
invented entities (3)
-
SARVLM-1M
no independent evidence
-
SARCLIP
no independent evidence
-
SARCap
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
two-stage domain transfer training strategy that leverages optical remote sensing data as an intermediate bridge... SARCLIP... contrastive loss... InfoNCE-based
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
SARVLM-1M... 1.7 million image-text pairs... templated text synthesis
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 1 Pith paper
-
Geospatial-Temporal Sensemaking of Remote Sensing Activity Detections with Multimodal Large Language Model
Introduces the SMART-HC-VQA dataset with 65k single-image and 2.3M temporal VQA examples plus an adapted LLaVA-NeXT MLLM framework for geospatial-temporal sensemaking of remote sensing construction activity.
Reference graph
Works this paper leans on
-
[1]
AFR Lab. The air force moving and stationary target recog- nition database.https://www.sdms.afrl.af.mil/ index.php?collection=mstar, 1995. 4, 5
work page 1995
-
[2]
Omnisat: Self-supervised modality fusion for earth observation
Guillaume Astruc, Nicolas Gonthier, Clement Mallet, and Loic Landrieu. Omnisat: Self-supervised modality fusion for earth observation. InEuropean Conference on Computer Vision, pages 409–427. Springer, 2024. 3
work page 2024
-
[3]
Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments
Satanjeev Banerjee and Alon Lavie. Meteor: An automatic metric for mt evaluation with improved correlation with hu- man judgments. InProceedings of the acl workshop on in- trinsic and extrinsic evaluation measures for machine trans- lation and/or summarization, pages 65–72, 2005. 6
work page 2005
-
[4]
Sizhe Chen, Haipeng Wang, Feng Xu, and Ya-Qiu Jin. Tar- get classification using the deep convolutional networks for sar images.IEEE Transactions on Geoscience and Remote Sensing, 54(8):4806–4817, 2016. 5
work page 2016
-
[5]
A simple framework for contrastive learn- ing of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learn- ing of visual representations. InInternational Conference on Machine Learning, pages 1597–1607. PMLR, 2020. 1
work page 2020
-
[6]
Reproducible scal- ing laws for contrastive language-image learning
Mehdi Cherti, Romain Beaumont, Ross Wightman, Mitchell Wortsman, Gabriel Ilharco, Cade Gordon, Christoph Schuh- mann, Ludwig Schmidt, and Jenia Jitsev. Reproducible scal- ing laws for contrastive language-image learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2818–2829, 2023. 6
work page 2023
-
[7]
Rouge: A package for automatic evaluation of summaries
Lin Chin-Yew. Rouge: A package for automatic evaluation of summaries. InProceedings of the Workshop on Text Sum- marization Branches Out, 2004, 2004. 6
work page 2004
-
[8]
Yezhen Cong, Samar Khanna, Chenlin Meng, Patrick Liu, Erik Rozi, Yutong He, Marshall Burke, David Lobell, and Stefano Ermon. Satmae: Pre-training transformers for tem- poral and multi-spectral satellite imagery.Advances in Neu- ral Information Processing Systems, 35:197–211, 2022. 3
work page 2022
-
[9]
Bin Deng, Puhong Duan, Xukun Lu, Zihao Wang, and Xudong Kang. Hyperspectral and sar image classification via graph convolutional fusion network.IEEE Transactions on Geoscience and Remote Sensing, 2024. 1
work page 2024
-
[10]
Yiguo He, Junjie Zhu, Yiying Li, Qiangjuan Huang, Zhiyuan Wang, and Ke Yang. Rethinking remote sensing clip: Lever- aging multimodal large language models for high-quality vision-language dataset. InInternational Conference on Neural Information Processing, pages 417–431. Springer,
-
[11]
Yiguo He, Junjie Zhu, Yiying Li, Xiaoyu Zhang, Chunping Qiu, Jun Wang, Qiangjuan Huang, and Ke Yang. Enhancing remote sensing vision-language models through mllm and llm-based high-quality image-text dataset generation.arXiv preprint arXiv:2507.16716, 2025. 6
-
[12]
Xiyue Hou, Wei Ao, Qian Song, Jian Lai, Haipeng Wang, and Feng Xu. Fusar-ship: Building a high-resolution sar-ais matchup dataset of gaofen-3 for ship detection and recog- nition.Science China Information Sciences, 63(4):140303,
-
[13]
Pallavi Jain, Bianca Schoen-Phelan, and Robert Ross. Self- supervised learning for invariant representations from multi- spectral and sar images.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 15:7797– 7808, 2022. 2
work page 2022
-
[14]
Yuzhuo Kang, Zhirui Wang, Jiamei Fu, Xian Sun, and Kun Fu. Sfr-net: Scattering feature relation network for aircraft detection in complex sar images.IEEE Transactions on Geo- science and Remote Sensing, 60:1–17, 2021. 1
work page 2021
-
[15]
Geochat: Grounded large vision-language model for remote sensing
Kartik Kuckreja, Muhammad Sohail Danish, Muzammal Naseer, Abhijit Das, Salman Khan, and Fahad Shahbaz Khan. Geochat: Grounded large vision-language model for remote sensing. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27831– 27840, 2024. 3
work page 2024
-
[16]
Synthetic sar image generation using sensor, terrain and target models
Anders Kusk, Adili Abulaitijiang, and Jorgen Dall. Synthetic sar image generation using sensor, terrain and target models. InProceedings of EUSAR 2016: 11th European Conference on Synthetic Aperture Radar, pages 1–5. VDE, 2016. 4
work page 2016
-
[17]
A sar dataset for atr development: the synthetic and measured paired labeled experiment (sample)
Benjamin Lewis, Theresa Scarnati, Elizabeth Sudkamp, John Nehrbass, Stephen Rosencrantz, and Edmund Zelnio. A sar dataset for atr development: the synthetic and measured paired labeled experiment (sample). InAlgorithms for Syn- thetic Aperture Radar Imagery XXVI, pages 39–54. SPIE,
-
[18]
Boying Li, Bin Liu, Lanqing Huang, Weiwei Guo, Zenghui Zhang, and Wenxian Yu. Opensarship 2.0: A large-volume dataset for deeper interpretation of ship targets in sentinel-1 imagery. In2017 SAR in Big Data Era: Models, Methods and Applications (BIGSARDATA), pages 1–5. IEEE, 2017. 4
work page 2017
-
[19]
Weijie Li, Wei Yang, Tianpeng Liu, Yuenan Hou, Yuxuan Li, Zhen Liu, Yongxiang Liu, and Li Liu. Predicting gradient is better: Exploring self-supervised learning for sar atr with a joint-embedding predictive architecture.ISPRS Journal of Photogrammetry and Remote Sensing, 218:326–338, 2024. 3
work page 2024
-
[20]
Weijie Li, Wei Yang, Yuenan Hou, Li Liu, Yongxiang Liu, and Xiang Li. Saratr-x: Towards building a foundation model for sar target recognition.IEEE Transactions on Im- age Processing, 2025. 2, 3, 5
work page 2025
-
[21]
Yuxuan Li, Xiang Li, Weijie Li, Qibin Hou, Li Liu, Ming- Ming Cheng, and Jian Yang. Sardet-100k: Towards open- source benchmark and toolkit for large-scale sar object de- tection.arXiv preprint arXiv:2403.06534, 2024. 2, 3, 4
-
[22]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. Re- moteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 6
work page 2024
-
[23]
Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in Neural Information Processing Systems, 36:34892–34916, 2023. 3
work page 2023
-
[24]
Wang Liu, Zhiyu Wang, Xin Guo, Puhong Duan, Xudong Kang, and Shutao Li. Learning from noisy pseudo- labels for all-weather land cover mapping.arXiv preprint arXiv:2504.13458, 2025. 1
-
[25]
Atrnet-star: A large dataset and bench- mark towards remote sensing object recognition in the wild,
Yongxiang Liu, Weijie Li, Li Liu, Jie Zhou, Bowen Peng, Yafei Song, Xuying Xiong, Wei Yang, Tianpeng Liu, Zhen Liu, and Xiang Li. Atrnet-star: A large dataset and bench- mark towards remote sensing object recognition in the wild,
-
[26]
Xiaoqiang Lu, Binqiang Wang, Xiangtao Zheng, and Xue- long Li. Exploring models and data for remote sensing im- age caption generation.IEEE Transactions on Geoscience and Remote Sensing, 56(4):2183–2195, 2017. 5
work page 2017
-
[27]
Zhiming Ma, Xiayang Xiao, Sihao Dong, Peidong Wang, HaiPeng Wang, and Qingyun Pan. Sarchat-bench-2m: A multi-task vision-language benchmark for sar image inter- pretation.arXiv preprint arXiv:2502.08168, 2025. 2, 3
-
[28]
Visualizing data using t-sne.Journal of Machine Learning Research, 9 (Nov):2579–2605, 2008
Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of Machine Learning Research, 9 (Nov):2579–2605, 2008. 8
work page 2008
-
[29]
Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl V ondrick, Bharath Hariharan, and Kavita Bala. Re- mote sensing vision-language foundation models without annotations via ground remote alignment.arXiv preprint arXiv:2312.06960, 2023. 3
-
[30]
David Malmgren-Hansen, Anders Kusk, Jørgen Dall, Al- lan Aasbjerg Nielsen, Rasmus Engholm, and Henning Skriver. Improving sar automatic target recognition models with transfer learning from simulated data.IEEE Geoscience and Remote Sensing Letters, 14(9):1484–1488, 2017. 4
work page 2017
-
[31]
Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model
Dilxat Muhtar, Zhenshi Li, Feng Gu, Xueliang Zhang, and Pengfeng Xiao. Lhrs-bot: Empowering remote sensing with vgi-enhanced large multimodal language model. In European Conference on Computer Vision, pages 440–457. Springer, 2024. 3
work page 2024
-
[32]
Bleu: a method for automatic evaluation of machine translation
Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic evaluation of machine translation. InProceedings of the 40th annual meeting of the Association for Computational Linguistics, pages 311–318,
-
[33]
Learn- ing transferable visual models from natural language super- vision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learn- ing transferable visual models from natural language super- vision. InInternational Conference on Machine Learning, pages 8748–8763. PMLR, 2021. 1, 2, 3
work page 2021
-
[34]
Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning
Colorado J Reed, Ritwik Gupta, Shufan Li, Sarah Brock- man, Christopher Funk, Brian Clipp, Kurt Keutzer, Salvatore Candido, Matt Uyttendaele, and Trevor Darrell. Scale-mae: A scale-aware masked autoencoder for multiscale geospatial representation learning. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 4088– 4099, 2023. 2, 3
work page 2023
-
[35]
Xian Sun, Yixuan Lv, Zhirui Wang, and Kun Fu. Scan: Scat- tering characteristics analysis network for few-shot aircraft classification in high-resolution sar images.IEEE Transac- tions on Geoscience and Remote Sensing, 60:1–17, 2022. 5
work page 2022
-
[36]
Xian Sun, Peijin Wang, Wanxuan Lu, Zicong Zhu, Xiao- nan Lu, Qibin He, Junxi Li, Xuee Rong, Zhujun Yang, Hao Chang, et al. Ringmo: A remote sensing foundation model with masked image modeling.IEEE Transactions on Geo- science and Remote Sensing, 61:1–22, 2022. 3
work page 2022
-
[37]
Maofeng Tang, Andrei Cozma, Konstantinos Georgiou, and Hairong Qi. Cross-scale mae: A tale of multiscale exploita- tion in remote sensing.Advances in Neural Information Pro- cessing Systems, 36:20054–20066, 2023. 2, 3
work page 2023
-
[38]
Cider: Consensus-based image description evalua- tion
Ramakrishna Vedantam, C Lawrence Zitnick, and Devi Parikh. Cider: Consensus-based image description evalua- tion. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 4566–4575, 2015. 6
work page 2015
-
[39]
Junjue Wang, Zhuo Zheng, Ailong Ma, Xiaoyan Lu, and Yanfei Zhong. Loveda: A remote sensing land-cover dataset for domain adaptive semantic segmentation.arXiv preprint arXiv:2110.08733, 2021. 5
-
[40]
Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing
Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal. Skyscript: A large and seman- tically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intel- ligence, pages 5805–5813, 2024. 2, 3, 6
work page 2024
-
[41]
Yimin Wei, Aoran Xiao, Yexian Ren, Yuting Zhu, Hongruix- uan Chen, Junshi Xia, and Naoto Yokoya. Sarlang-1m: A benchmark for vision-language modeling in sar image un- derstanding.arXiv preprint arXiv:2504.03254, 2025. 2, 3, 4
-
[42]
Robust fine-tuning of zero-shot models
Mitchell Wortsman, Gabriel Ilharco, Jong Wook Kim, Mike Li, Simon Kornblith, Rebecca Roelofs, Raphael Gon- tijo Lopes, Hannaneh Hajishirzi, Ali Farhadi, Hongseok Namkoong, et al. Robust fine-tuning of zero-shot models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7959–7971, 2022. 6
work page 2022
-
[43]
Youming Wu, Yuxi Suo, Qingbiao Meng, Wei Dai, Tiao Miao, Wenchao Zhao, Zhiyuan Yan, Wenhui Diao, Guocun Xie, Qingyang Ke, et al. Fair-csar: A benchmark dataset for fine-grained object detection and recognition based on single look complex sar images.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 4
work page 2024
-
[44]
Dota: A large-scale dataset for object detection in aerial images
Gui-Song Xia, Xiang Bai, Jian Ding, Zhen Zhu, Serge Be- longie, Jiebo Luo, Mihai Datcu, Marcello Pelillo, and Liang- pei Zhang. Dota: A large-scale dataset for object detection in aerial images. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3974– 3983, 2018. 5
work page 2018
-
[45]
CoCa: Contrastive Captioners are Image-Text Foundation Models
Jiahui Yu, Zirui Wang, Vijay Vasudevan, Legg Yeung, Mo- jtaba Seyedhosseini, and Yonghui Wu. Coca: Contrastive captioners are image-text foundation models.arXiv preprint arXiv:2205.01917, 2022. 5
work page internal anchor Pith review Pith/arXiv arXiv 2022
-
[46]
Miao Yu, Heqiang Yuan, Jialiang Chen, Chongyang Hao, Zhe Wang, Zhiqiang Yuan, and Bin Lu. Selo v2: Toward for higher and faster semantic localization.IEEE Geoscience and Remote Sensing Letters, 20:1–5, 2023. 5
work page 2023
-
[47]
Zhiqiang Yuan, Wenkai Zhang, Chongyang Li, Zhaoying Pan, Yongqiang Mao, Jialiang Chen, Shuoke Li, Hongqi Wang, and Xian Sun. Learning to evaluate performance of multimodal semantic localization.IEEE Transactions on Geoscience and Remote Sensing, 60:1–18, 2022. 5, 6
work page 2022
-
[48]
Tianwen Zhang, Xiaoling Zhang, Jianwei Li, Xiaowo Xu, Baoyou Wang, Xu Zhan, Yanqin Xu, Xiao Ke, Tianjiao Zeng, Hao Su, et al. Sar ship detection dataset (ssdd): Offi- cial release and comprehensive data analysis.Remote Sens- ing, 13(18):3690, 2021. 1
work page 2021
-
[49]
Wei Zhang, Miaoxin Cai, Tong Zhang, Yin Zhuang, and Xuerui Mao. Earthgpt: A universal multi-modal large lan- guage model for multi-sensor image comprehension in re- mote sensing domain.IEEE Transactions on Geoscience and Remote Sensing, 2024. 3
work page 2024
-
[50]
Xin Zhang, Xue Yang, Yuxuan Li, Jian Yang, Ming- Ming Cheng, and Xiang Li. Rsar: Restricted state an- gle resolver and rotated sar benchmark.arXiv preprint arXiv:2501.04440, 2025. 1
-
[51]
Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. Rs5m and georsclip: A large scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing, 2024. 2, 3, 6
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.