pith. sign in

arxiv: 2604.20429 · v1 · submitted 2026-04-22 · 💻 cs.CV

Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing

Pith reviewed 2026-05-09 23:59 UTC · model grok-4.3

classification 💻 cs.CV
keywords remote sensingcross-modal retrievalimage-text retrievaltwo-stage frameworkmulti-granular representationefficient retrieval
0
0 comments X

The pith

A two-stage fast-then-fine framework retrieves remote sensing images from text with competitive accuracy and much higher speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to handle the tension between fine-grained cross-modal alignment and fast retrieval when querying massive remote sensing imagery that contains many objects and cluttered backgrounds. It does so by splitting the task into an initial text-agnostic recall stage that uses only coarse-grained representations to filter a small candidate set, followed by a rerank stage that applies text-guided interactions for precise matching. The rerank stage avoids extra learnable parameters by using a balanced interaction block, while an inter- and intra-modal loss trains the system across both coarse and fine representations. A reader should care because current alternatives either slow down every query with heavy interactions or demand large-scale pre-training; this split aims to deliver both practicality and accuracy for real-world remote sensing archives.

Core claim

The fast-then-fine framework decomposes cross-modal retrieval into a text-agnostic recall stage that employs coarse-grained representations for rapid candidate selection and a subsequent text-guided rerank stage that applies a parameter-free balanced interaction block to achieve fine-grained alignment, with both stages trained jointly by an inter- and intra-modal loss operating on multi-granular representations.

What carries the argument

The fast-then-fine two-stage architecture, in which the recall stage uses text-agnostic coarse representations to select candidates efficiently and the rerank stage employs a parameter-free text-guided interaction block to refine alignment.

If this is right

  • Complex cross-modal interaction can be limited to a small candidate set rather than applied to every image in the database.
  • Fine-grained alignment can be added after coarse filtering without introducing new trainable parameters in the interaction block.
  • Joint optimization across coarse and fine representations improves alignment quality for imagery with dense objects and complex backgrounds.
  • Overall query time drops substantially while retrieval accuracy stays competitive with single-stage methods on public remote sensing benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same staged separation of coarse filtering from fine alignment could be tested on other vision-language retrieval tasks that face dense or cluttered scenes.
  • Because the rerank block adds no learnable parameters, the approach may integrate readily with existing pre-trained encoders without full retraining.
  • Scaling the recall stage to larger archives would test whether the efficiency advantage grows with database size.

Load-bearing premise

The text-agnostic recall stage can reliably narrow the search to a small candidate set that still contains the correct fine-grained matches without discarding relevant items.

What would settle it

On standard benchmarks such as RSICD or RSITMD, count how many ground-truth matches are missing from the candidate set returned by the recall stage alone; a high miss rate would show the assumption fails.

Figures

Figures reproduced from arXiv: 2604.20429 by Shuquan Wei, Wei Wang, Xiangyang Jia, Xi Chen, Xu Chen, Xu Zhang.

Figure 1
Figure 1. Figure 1: Comparison of three paradigms for remote sensing [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Accuracy–efficiency comparison of task-specific RS [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: The overall FTF two-stage retrieval framework. Multi-granular visual embeddings [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Efficiency–accuracy balance of FTF under different [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Qualitative comparisons with representative base [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Qualitative examples illustrating the effectiveness of the proposed two-stage retrieval framework. In the recall stage, [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Qualitative examples illustrating the effectiveness [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗
read the original abstract

Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes a Fast-then-Fine (FTF) two-stage framework for cross-modal image-text retrieval in remote sensing. It decomposes retrieval into a text-agnostic recall stage using coarse-grained representations for efficient candidate selection, followed by a text-guided rerank stage employing a parameter-free balanced interaction block for fine-grained alignment. An inter- and intra-modal loss jointly optimizes multi-granular representations. The authors assert that extensive experiments on public benchmarks demonstrate competitive retrieval accuracy together with significant efficiency gains over prior methods.

Significance. If the experimental claims are substantiated, the work would be significant for remote sensing applications, where dense multi-object scenes demand both fine-grained cross-modal alignment and scalable retrieval. The separation into recall and rerank stages, combined with the explicitly parameter-free reranking block, offers a practical route to efficiency without large-scale pre-training or heavy interaction modules. The multi-granular loss provides a clean way to supervise representations at different levels of granularity.

major comments (2)
  1. [Experiments] The headline claim of competitive accuracy plus large efficiency gains rests on the text-agnostic recall stage reliably retrieving a small candidate pool that still contains essentially all correct fine-grained matches. The manuscript provides no explicit recall@K numbers for the first stage, no analysis of missed ground-truth pairs, and no ablation relating candidate-set size to final mean recall (mR). Without these data the central efficiency-accuracy tradeoff cannot be evaluated (Experiments section and associated tables).
  2. [Method] The rerank stage is described as using a 'parameter-free balanced text-guided interaction block.' It is unclear from the method description how the balancing mechanism is realized without introducing any learnable parameters or implicit fitting; the claim requires an explicit accounting of all operations and parameters in that block (Method section, rerank-stage subsection).
minor comments (2)
  1. [Abstract] The abstract asserts 'extensive experiments' yet supplies no quantitative highlights, error bars, or baseline comparisons; adding one or two key numbers would improve readability.
  2. [Method] Notation for the coarse- and fine-grained representations and the inter-/intra-modal loss terms would benefit from a compact table or diagram summarizing the multi-granular components.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and for recognizing the potential significance of the FTF framework. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analyses.

read point-by-point responses
  1. Referee: [Experiments] The headline claim of competitive accuracy plus large efficiency gains rests on the text-agnostic recall stage reliably retrieving a small candidate pool that still contains essentially all correct fine-grained matches. The manuscript provides no explicit recall@K numbers for the first stage, no analysis of missed ground-truth pairs, and no ablation relating candidate-set size to final mean recall (mR). Without these data the central efficiency-accuracy tradeoff cannot be evaluated (Experiments section and associated tables).

    Authors: We agree that explicit quantification of the recall stage performance is necessary to fully substantiate the efficiency-accuracy tradeoff. In the revised manuscript we will add, in the Experiments section, recall@K results for the text-agnostic recall stage at multiple candidate-pool sizes, a breakdown of any ground-truth pairs missed by the first stage, and an ablation table showing the effect of candidate-set size on final mR. These additions will be supported by the existing experimental setup and will allow direct evaluation of the claimed tradeoff. revision: yes

  2. Referee: [Method] The rerank stage is described as using a 'parameter-free balanced text-guided interaction block.' It is unclear from the method description how the balancing mechanism is realized without introducing any learnable parameters or implicit fitting; the claim requires an explicit accounting of all operations and parameters in that block (Method section, rerank-stage subsection).

    Authors: We acknowledge that the current description of the parameter-free balanced text-guided interaction block lacks sufficient detail. In the revised manuscript we will expand the rerank-stage subsection to provide a complete, step-by-step accounting of every operation. The balancing mechanism is implemented via fixed, non-learnable operations consisting of text-guided cross-attention followed by element-wise averaging with the original features and a static normalization factor; no trainable weights, implicit fitting, or additional parameters are introduced beyond the base encoders. The revised text will include the exact mathematical formulations and a parameter-count verification to confirm the block remains parameter-free. revision: yes

Circularity Check

0 steps flagged

No circularity: independent engineering proposal with no self-referential derivations

full rationale

The paper proposes an FTF two-stage framework that decomposes retrieval into a text-agnostic recall stage using coarse-grained representations and a text-guided rerank stage with a parameter-free balanced interaction block, plus an inter- and intra-modal loss for multi-granular alignment. No equations, derivations, or self-citations appear in the abstract or described method that reduce the claimed efficiency-accuracy tradeoff to fitted parameters renamed as predictions, self-definitional constructs, or load-bearing prior work by the same authors. Performance is asserted via experiments on public benchmarks rather than by construction from inputs. This is a standard self-contained engineering contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides no explicit free parameters, axioms, or invented entities; the framework is described at the level of architectural choices and loss design only.

pith-pipeline@v0.9.0 · 5515 in / 1002 out tokens · 25108 ms · 2026-05-09T23:59:08.782353+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

62 extracted references · 62 canonical work pages · 1 internal anchor

  1. [1]

    Mothilal Asokan, Kebin Wu, and Fatima Albreiki. 2025. FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14495–14504

  2. [2]

    Guillaume Astruc, Nicolas Gonthier, Clément Mallet, and Loic Landrieu. 2025. AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modali- ties. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19530–19540

  3. [3]

    Qianyue Bao, Fang Liu, Licheng Jiao, Yang Liu, Shuo Li, Lingling Li, Xu Liu, et al. 2025. Visual-Language Scene-Relation-Aware Zero-Shot Captioner.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 10 (2025), 8725–8739

  4. [4]

    Xiumei Chen, Xiangtao Zheng, and Xiaoqiang Lu. 2025. Context-aware local- global semantic alignment for remote sensing image-text retrieval.IEEE Transac- tions on Geoscience and Remote Sensing(2025)

  5. [5]

    Hang Cheng, Hehui Ye, Xiaofei Zhou, Ximeng Liu, Fei Chen, and Meiqing Wang

  6. [6]

    Vision-language pre-training via modal interaction.Pattern Recognition 156 (2024), 110809

  7. [7]

    Hyojin Choi et al. 2025. GOAL: Global-local Object Alignment Learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  8. [8]

    Zhen Dai et al. 2025. Text–video retrieval re-ranking via multi-grained cross attention and token selectors.Pattern Recognition160 (2025), 111198

  9. [9]

    Zuozhuo Dai, Kaihui Cheng, Fangtao Shao, Zilong Dong, and Siyu Zhu. 2025. Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders.Pattern Recognition159 (2025), 111099. doi:10.1016/j.patcog.2024. 111099

  10. [10]

    Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, and Weijia Li. 2026. Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents. InInternational Conference on Learning Representations (ICLR)

  11. [11]

    Yangpeng He, Xin Xu, Hongjia Chen, Jinwen Li, and Fangling Pu. 2024. Visual global-salient-guided network for remote sensing image-text retrieval.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–14

  12. [12]

    Huang et al

    Z. Huang et al. 2025. Noise-Robust Vision-Language Pre-Training With Positive- Negative Learning.IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)

  13. [13]

    Zhong Ji, Changxu Meng, Yan Zhang, Yanwei Pang, and Xuelong Li. 2023. Knowledge-aided momentum contrastive learning for remote-sensing image text retrieval.IEEE Transactions on Geoscience and Remote Sensing61 (2023), 1–13

  14. [14]

    Zhong Ji, Changxu Meng, Yan Zhang, Haoran Wang, Yanwei Pang, and Jun- gong Han. 2024. Eliminate before align: A remote sensing image-text retrieval framework with keyword explicit reasoning. InProceedings of the 32nd ACM international conference on multimedia. 1662–1671

  15. [15]

    Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Shiji Song, and Gao Huang. 2025. Cross-modal adapter for vision-language retrieval.Pattern Recognition159 (2025), 111144. doi:10.1016/j.patcog.2024.111144

  16. [16]

    Kwanyoung Kim, Yujin Oh, and Jong Chul Ye. 2024. OTSeg: Multi-Prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation. InComputer Vision – ECCV 2024. 200–217. doi:10.1007/978-3-031-72980-5_12

  17. [17]

    Huakai Lai, Guoxin Xiong, Huayu Mai, Xiang Liu, and Tianzhu Zhang. 2025. Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  18. [18]

    Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. InProceedings of the European conference on computer vision (ECCV). 201–216

  19. [19]

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProceedings of the 40th International Conference on Machine Learning

  20. [20]

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrap- ping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. InProceedings of the 39th International Conference on Machine Learning

  21. [21]

    Shuoshuo Li, Shuli Cheng, and Liejun Wang. 2025. Entity-Level Alignment with Prompt-Guided Adapter for Remote Sensing Image-Text Retrieval. InProceedings of the 33rd ACM International Conference on Multimedia. 8224–8233

  22. [22]

    Zhen Li et al. 2025. Cross-coupled semantic adversarial network for cross-modal retrieval.Artificial Intelligence Review(2025)

  23. [23]

    Zhenshi Li, Dilxat Muhtar, Feng Gu, Yanglangxing He, Xueliang Zhang, Pengfeng Xiao, Guangjun He, and Xiaoxiang Zhu. 2025. Lhrs-bot-nova: Improved multi- modal large language model for remote sensing vision-language interpretation. ISPRS Journal of Photogrammetry and Remote Sensing227 (2025), 539–550

  24. [24]

    Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–16

  25. [26]

    Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval and Captioning.Neurocomputing508 (2022), 293–304. doi:10.1016/j. neucom.2022.07.028

  26. [27]

    Zehong Ma, Hao Chen, Wei Zeng, Limin Su, and Shiliang Zhang. 2025. Multi- Modal Reference Learning for Fine-Grained Text-to-Image Retrieval.IEEE Trans- actions on Multimedia27 (2025), 5009–5022. doi:10.1109/TMM.2025.3543066

  27. [28]

    Utkarsh Mall, Cheng Perng Phoo, Meilin Kelsey Liu, Carl Vondrick, Bharath Hariharan, and Kavita Bala. 2023. Remote sensing vision-language founda- tion models without annotations via ground remote alignment.arXiv preprint arXiv:2312.06960(2023)

  28. [29]

    Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van Gool, and Federico Tombari. 2024. SILC: Improving Vision Language Pre- training with Self-Distillation. InComputer Vision – ECCV 2024. 38–56

  29. [30]

    Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Lab...

  30. [31]

    Pengxiang Ouyang, Qing Ma, and Cong Bai. 2025. Sparse Information Percep- tion Network for Remote Sensing Cross-Modal Retrieval.IEEE Transactions on Geoscience and Remote Sensing(2025)

  31. [32]

    Jiancheng Pan, Qing Ma, and Cong Bai. 2023. A prior instruction representation framework for remote sensing image-text retrieval. InProceedings of the 31st ACM International Conference on Multimedia. 611–620

  32. [33]

    Xingyu Qin et al. 2025. CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  33. [34]

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763

  34. [35]

    Furao Shen, Xinwang Liu, Zheng Zeng, et al . 2017. Two-Stage Reranking for Remote Sensing Image Retrieval.IEEE Geoscience and Remote Sensing Letters (2017)

  35. [36]

    Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, et al. 2025. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference. 14303–14313

  36. [37]

    Zengbao Sun, Ming Zhao, Gaorui Liu, and André Kaup. 2024. Cross-Modal Pre- Aligned Method with Global and Local Information for Remote-Sensing Image and Text Retrieval.arXiv preprint arXiv:2411.14704(2024)

  37. [38]

    Rong-Cheng Tu, Yatai Ji, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, and Wei Liu. 2025. Global and Local Semantic Com- pletion Learning for Vision-Language Pre-Training.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 12 (2025), 11065–11079

  38. [39]

    Jingyao Wang, Zheng Liu, Shanshan Gao, Junhao Xu, and Changhao Li. 2025. From external to internal: Step-wise feature enhancement network for image-text retrieval.Neural Networks(2025). doi:10.1016/j.neunet.2025.108072

  39. [40]

    Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taixé, and Xiao Xiang Zhu

    Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taixé, and Xiao Xiang Zhu. 2025. Towards a Unified Copernicus Conference’17, July 2017, Washington, DC, USA X. Chen et al. Foundation Model for Earth Vision. InProceedings of the IEEE/CVF Interna...

  40. [41]

    Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. Camp: Cross-modal adaptive message passing for text-image retrieval. InProceedings of the IEEE/CVF international conference on computer vision. 5764–5773

  41. [42]

    Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal

  42. [43]

    InProceedings of the AAAI Conference on Artificial Intelligence, Vol

    Skyscript: A large and semantically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5805–5813

  43. [44]

    Wei et al

    Z. Wei et al . 2025. HQ-CLIP: Leveraging Large Vision-Language Models to Create High-Quality Image-Text Datasets for Contrastive Language-Image Pre- training. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV)

  44. [45]

    Yinghao Xiong, Xinhui Tu, and Weizhong Zhao. 2025. AFR-Rank: An effective and highly efficient LLM-based listwise reranking framework via filtering noise documents.Information Processing & Management62, 6 (2025), 104232. doi:10. 1016/j.ipm.2025.104232

  45. [46]

    Shin’ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, and Daiki Chijiwa. 2025. Post-pre-training for Modality Alignment in Vision-Language Foundation Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

  46. [47]

    Shin’ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, and Daiki Chijiwa. 2025. Post-pre-training for Modality Alignment in Vision-Language Foundation Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  47. [48]

    Lingling Yang, Tongqing Zhou, Wentao Ma, Mengze Du, Lu Liu, Feng Li, Shan Zhao, and Yuwei Wang. 2024. Remote sensing image-text retrieval with implicit- explicit relation reasoning.IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–11

  48. [49]

    Rui Yang, Shuang Wang, Yingping Han, Yuanheng Li, Dong Zhao, Dou Quan, Yanhe Guo, Licheng Jiao, and Zhi Yang. 2024. Transcending fusion: A multiscale alignment method for remote sensing image–text retrieval.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–17

  49. [50]

    Wei Yang, Jingjing Fu, Rui Wang, Jinyu Wang, Lei Song, and Jiang Bian. 2025. OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multi- modal Retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 24545–24563

  50. [51]

    Xiaolei Yang et al . 2022. Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information.Remote Sensing14, 9 (2022), 2135

  51. [52]

    Yuan Yuan, Yang Zhan, and Zhitong Xiong. 2023. Parameter-efficient transfer learning for remote sensing image–text retrieval.IEEE Transactions on Geoscience and Remote Sensing61 (2023), 1–14

  52. [53]

    Zhiqiang Yuan, Wenkai Zhang, Kun Fu, Xuan Li, Chubo Deng, Hongqi Wang, and Xian Sun. 2022. Exploring a fine-grained multiscale method for cross-modal remote sensing image retrieval.arXiv preprint arXiv:2204.09868(2022)

  53. [54]

    Zhiqiang Yuan, Wenkai Zhang, Xuee Rong, Xuan Li, Jialiang Chen, Hongqi Wang, Kun Fu, and Xian Sun. 2021. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing.IEEE Transactions on Geoscience and Remote Sensing60 (2021), 1–19

  54. [55]

    Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, Kun Fu, and Xian Sun. 2022. Remote sensing cross-modal text- image retrieval based on global and local information.IEEE Transactions on Geoscience and Remote Sensing60 (2022), 1–16

  55. [56]

    Jinxu Zhang, Yongqi Yu, and Yu Zhang. 2024. CREAM: Coarse-to-Fine Retrieval and Multi-modal Efficient Tuning for Document VQA. InProceedings of the 32nd ACM International Conference on Multimedia

  56. [57]

    Xiong Zhang, Weipeng Li, Xu Wang, Luyao Wang, Fuzhong Zheng, Long Wang, and Haisu Zhang. 2023. A fusion encoder with multi-task guidance for cross- modal text–image retrieval in remote sensing.Remote Sensing15, 18 (2023), 4637

  57. [58]

    Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. 2024. RS5M and GeoRSCLIP: A large-scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–23

  58. [59]

    Shengwei Zhao, Linhai Xu, Yuying Liu, and Shaoyi Du. 2023. Multi-grained Representation Learning for Cross-modal Retrieval. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2194–2198

  59. [60]

    Chengyu Zheng, Jie Nie, Bo Yin, Xiu Li, Yuntao Qian, and Zhiqiang Wei. 2025. Frequency and Spatial-domain Saliency Network for Remote Sensing Cross- Modal Retrieval.IEEE Transactions on Geoscience and Remote Sensing(2025)

  60. [61]

    Fuzhong Zheng, Xu Wang, Luyao Wang, Xiong Zhang, Hongze Zhu, Long Wang, and Haisu Zhang. 2023. A fine-grained semantic alignment method specific to aggregate multi-scale information for cross-modal remote sensing image retrieval. Sensors23, 20 (2023), 8437

  61. [62]

    Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian. 2025. MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, ...

  62. [63]

    Jun Zhu et al. 2025. Unified semantic space learning for cross-modal retrieval. Neural Networks(2025)