Fast-then-Fine: A Two-Stage Framework with Multi-Granular Representation for Cross-Modal Retrieval in Remote Sensing
Pith reviewed 2026-05-09 23:59 UTC · model grok-4.3
The pith
A two-stage fast-then-fine framework retrieves remote sensing images from text with competitive accuracy and much higher speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The fast-then-fine framework decomposes cross-modal retrieval into a text-agnostic recall stage that employs coarse-grained representations for rapid candidate selection and a subsequent text-guided rerank stage that applies a parameter-free balanced interaction block to achieve fine-grained alignment, with both stages trained jointly by an inter- and intra-modal loss operating on multi-granular representations.
What carries the argument
The fast-then-fine two-stage architecture, in which the recall stage uses text-agnostic coarse representations to select candidates efficiently and the rerank stage employs a parameter-free text-guided interaction block to refine alignment.
If this is right
- Complex cross-modal interaction can be limited to a small candidate set rather than applied to every image in the database.
- Fine-grained alignment can be added after coarse filtering without introducing new trainable parameters in the interaction block.
- Joint optimization across coarse and fine representations improves alignment quality for imagery with dense objects and complex backgrounds.
- Overall query time drops substantially while retrieval accuracy stays competitive with single-stage methods on public remote sensing benchmarks.
Where Pith is reading between the lines
- The same staged separation of coarse filtering from fine alignment could be tested on other vision-language retrieval tasks that face dense or cluttered scenes.
- Because the rerank block adds no learnable parameters, the approach may integrate readily with existing pre-trained encoders without full retraining.
- Scaling the recall stage to larger archives would test whether the efficiency advantage grows with database size.
Load-bearing premise
The text-agnostic recall stage can reliably narrow the search to a small candidate set that still contains the correct fine-grained matches without discarding relevant items.
What would settle it
On standard benchmarks such as RSICD or RSITMD, count how many ground-truth matches are missing from the candidate set returned by the recall stage alone; a high miss rate would show the assumption fails.
Figures
read the original abstract
Remote sensing (RS) image-text retrieval plays a critical role in understanding massive RS imagery. However, the dense multi-object distribution and complex backgrounds in RS imagery make it difficult to simultaneously achieve fine-grained cross-modal alignment and efficient retrieval. Existing methods either rely on complex cross-modal interactions that lead to low retrieval efficiency, or depend on large-scale vision-language model pre-training, which requires massive data and computational resources. To address these issues, we propose a fast-then-fine (FTF) two-stage retrieval framework that decomposes retrieval into a text-agnostic recall stage for efficient candidate selection and a text-guided rerank stage for fine-grained alignment. Specifically, in the recall stage, text-agnostic coarse-grained representations are employed for efficient candidate selection; in the rerank stage, a parameter-free balanced text-guided interaction block enhances fine-grained alignment without introducing additional learnable parameters. Furthermore, an inter- and intra-modal loss is designed to jointly optimize cross-modal alignment across multi-granular representations. Extensive experiments on public benchmarks demonstrate that the FTF achieves competitive retrieval accuracy while significantly improving retrieval efficiency compared with existing methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a Fast-then-Fine (FTF) two-stage framework for cross-modal image-text retrieval in remote sensing. It decomposes retrieval into a text-agnostic recall stage using coarse-grained representations for efficient candidate selection, followed by a text-guided rerank stage employing a parameter-free balanced interaction block for fine-grained alignment. An inter- and intra-modal loss jointly optimizes multi-granular representations. The authors assert that extensive experiments on public benchmarks demonstrate competitive retrieval accuracy together with significant efficiency gains over prior methods.
Significance. If the experimental claims are substantiated, the work would be significant for remote sensing applications, where dense multi-object scenes demand both fine-grained cross-modal alignment and scalable retrieval. The separation into recall and rerank stages, combined with the explicitly parameter-free reranking block, offers a practical route to efficiency without large-scale pre-training or heavy interaction modules. The multi-granular loss provides a clean way to supervise representations at different levels of granularity.
major comments (2)
- [Experiments] The headline claim of competitive accuracy plus large efficiency gains rests on the text-agnostic recall stage reliably retrieving a small candidate pool that still contains essentially all correct fine-grained matches. The manuscript provides no explicit recall@K numbers for the first stage, no analysis of missed ground-truth pairs, and no ablation relating candidate-set size to final mean recall (mR). Without these data the central efficiency-accuracy tradeoff cannot be evaluated (Experiments section and associated tables).
- [Method] The rerank stage is described as using a 'parameter-free balanced text-guided interaction block.' It is unclear from the method description how the balancing mechanism is realized without introducing any learnable parameters or implicit fitting; the claim requires an explicit accounting of all operations and parameters in that block (Method section, rerank-stage subsection).
minor comments (2)
- [Abstract] The abstract asserts 'extensive experiments' yet supplies no quantitative highlights, error bars, or baseline comparisons; adding one or two key numbers would improve readability.
- [Method] Notation for the coarse- and fine-grained representations and the inter-/intra-modal loss terms would benefit from a compact table or diagram summarizing the multi-granular components.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and for recognizing the potential significance of the FTF framework. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and additional analyses.
read point-by-point responses
-
Referee: [Experiments] The headline claim of competitive accuracy plus large efficiency gains rests on the text-agnostic recall stage reliably retrieving a small candidate pool that still contains essentially all correct fine-grained matches. The manuscript provides no explicit recall@K numbers for the first stage, no analysis of missed ground-truth pairs, and no ablation relating candidate-set size to final mean recall (mR). Without these data the central efficiency-accuracy tradeoff cannot be evaluated (Experiments section and associated tables).
Authors: We agree that explicit quantification of the recall stage performance is necessary to fully substantiate the efficiency-accuracy tradeoff. In the revised manuscript we will add, in the Experiments section, recall@K results for the text-agnostic recall stage at multiple candidate-pool sizes, a breakdown of any ground-truth pairs missed by the first stage, and an ablation table showing the effect of candidate-set size on final mR. These additions will be supported by the existing experimental setup and will allow direct evaluation of the claimed tradeoff. revision: yes
-
Referee: [Method] The rerank stage is described as using a 'parameter-free balanced text-guided interaction block.' It is unclear from the method description how the balancing mechanism is realized without introducing any learnable parameters or implicit fitting; the claim requires an explicit accounting of all operations and parameters in that block (Method section, rerank-stage subsection).
Authors: We acknowledge that the current description of the parameter-free balanced text-guided interaction block lacks sufficient detail. In the revised manuscript we will expand the rerank-stage subsection to provide a complete, step-by-step accounting of every operation. The balancing mechanism is implemented via fixed, non-learnable operations consisting of text-guided cross-attention followed by element-wise averaging with the original features and a static normalization factor; no trainable weights, implicit fitting, or additional parameters are introduced beyond the base encoders. The revised text will include the exact mathematical formulations and a parameter-count verification to confirm the block remains parameter-free. revision: yes
Circularity Check
No circularity: independent engineering proposal with no self-referential derivations
full rationale
The paper proposes an FTF two-stage framework that decomposes retrieval into a text-agnostic recall stage using coarse-grained representations and a text-guided rerank stage with a parameter-free balanced interaction block, plus an inter- and intra-modal loss for multi-granular alignment. No equations, derivations, or self-citations appear in the abstract or described method that reduce the claimed efficiency-accuracy tradeoff to fitted parameters renamed as predictions, self-definitional constructs, or load-bearing prior work by the same authors. Performance is asserted via experiments on public benchmarks rather than by construction from inputs. This is a standard self-contained engineering contribution.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Mothilal Asokan, Kebin Wu, and Fatima Albreiki. 2025. FineLIP: Extending CLIP’s Reach via Fine-Grained Alignment with Longer Text Inputs. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 14495–14504
work page 2025
-
[2]
Guillaume Astruc, Nicolas Gonthier, Clément Mallet, and Loic Landrieu. 2025. AnySat: One Earth Observation Model for Many Resolutions, Scales, and Modali- ties. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). 19530–19540
work page 2025
-
[3]
Qianyue Bao, Fang Liu, Licheng Jiao, Yang Liu, Shuo Li, Lingling Li, Xu Liu, et al. 2025. Visual-Language Scene-Relation-Aware Zero-Shot Captioner.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 10 (2025), 8725–8739
work page 2025
-
[4]
Xiumei Chen, Xiangtao Zheng, and Xiaoqiang Lu. 2025. Context-aware local- global semantic alignment for remote sensing image-text retrieval.IEEE Transac- tions on Geoscience and Remote Sensing(2025)
work page 2025
-
[5]
Hang Cheng, Hehui Ye, Xiaofei Zhou, Ximeng Liu, Fei Chen, and Meiqing Wang
-
[6]
Vision-language pre-training via modal interaction.Pattern Recognition 156 (2024), 110809
work page 2024
-
[7]
Hyojin Choi et al. 2025. GOAL: Global-local Object Alignment Learning. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
work page 2025
-
[8]
Zhen Dai et al. 2025. Text–video retrieval re-ranking via multi-grained cross attention and token selectors.Pattern Recognition160 (2025), 111198
work page 2025
-
[9]
Zuozhuo Dai, Kaihui Cheng, Fangtao Shao, Zilong Dong, and Siyu Zhu. 2025. Text–video retrieval re-ranking via multi-grained cross attention and frozen image encoders.Pattern Recognition159 (2025), 111099. doi:10.1016/j.patcog.2024. 111099
-
[10]
Peilin Feng, Zhutao Lv, Junyan Ye, Xiaolei Wang, Xinjie Huo, Jinhua Yu, Wanghan Xu, Wenlong Zhang, Lei Bai, Conghui He, and Weijia Li. 2026. Earth-Agent: Unlocking the Full Landscape of Earth Observation with Agents. InInternational Conference on Learning Representations (ICLR)
work page 2026
-
[11]
Yangpeng He, Xin Xu, Hongjia Chen, Jinwen Li, and Fangling Pu. 2024. Visual global-salient-guided network for remote sensing image-text retrieval.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–14
work page 2024
-
[12]
Z. Huang et al. 2025. Noise-Robust Vision-Language Pre-Training With Positive- Negative Learning.IEEE Transactions on Pattern Analysis and Machine Intelligence (2025)
work page 2025
-
[13]
Zhong Ji, Changxu Meng, Yan Zhang, Yanwei Pang, and Xuelong Li. 2023. Knowledge-aided momentum contrastive learning for remote-sensing image text retrieval.IEEE Transactions on Geoscience and Remote Sensing61 (2023), 1–13
work page 2023
-
[14]
Zhong Ji, Changxu Meng, Yan Zhang, Haoran Wang, Yanwei Pang, and Jun- gong Han. 2024. Eliminate before align: A remote sensing image-text retrieval framework with keyword explicit reasoning. InProceedings of the 32nd ACM international conference on multimedia. 1662–1671
work page 2024
-
[15]
Haojun Jiang, Jianke Zhang, Rui Huang, Chunjiang Ge, Zanlin Ni, Shiji Song, and Gao Huang. 2025. Cross-modal adapter for vision-language retrieval.Pattern Recognition159 (2025), 111144. doi:10.1016/j.patcog.2024.111144
-
[16]
Kwanyoung Kim, Yujin Oh, and Jong Chul Ye. 2024. OTSeg: Multi-Prompt Sinkhorn Attention for Zero-Shot Semantic Segmentation. InComputer Vision – ECCV 2024. 200–217. doi:10.1007/978-3-031-72980-5_12
-
[17]
Huakai Lai, Guoxin Xiong, Huayu Mai, Xiang Liu, and Tianzhu Zhang. 2025. Rethinking Noisy Video-Text Retrieval via Relation-aware Alignment. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
work page 2025
-
[18]
Kuang-Huei Lee, Xi Chen, Gang Hua, Houdong Hu, and Xiaodong He. 2018. Stacked cross attention for image-text matching. InProceedings of the European conference on computer vision (ECCV). 201–216
work page 2018
-
[19]
Junnan Li, Dongxu Li, Silvio Savarese, and Steven C. H. Hoi. 2023. BLIP-2: Bootstrapping Language-Image Pre-training with Frozen Image Encoders and Large Language Models. InProceedings of the 40th International Conference on Machine Learning
work page 2023
-
[20]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven C. H. Hoi. 2022. BLIP: Bootstrap- ping Language-Image Pre-training for Unified Vision-Language Understanding and Generation. InProceedings of the 39th International Conference on Machine Learning
work page 2022
-
[21]
Shuoshuo Li, Shuli Cheng, and Liejun Wang. 2025. Entity-Level Alignment with Prompt-Guided Adapter for Remote Sensing Image-Text Retrieval. InProceedings of the 33rd ACM International Conference on Multimedia. 8224–8233
work page 2025
-
[22]
Zhen Li et al. 2025. Cross-coupled semantic adversarial network for cross-modal retrieval.Artificial Intelligence Review(2025)
work page 2025
-
[23]
Zhenshi Li, Dilxat Muhtar, Feng Gu, Yanglangxing He, Xueliang Zhang, Pengfeng Xiao, Guangjun He, and Xiaoxiang Zhu. 2025. Lhrs-bot-nova: Improved multi- modal large language model for remote sensing vision-language interpretation. ISPRS Journal of Photogrammetry and Remote Sensing227 (2025), 539–550
work page 2025
-
[24]
Fan Liu, Delong Chen, Zhangqingyun Guan, Xiaocong Zhou, Jiale Zhu, Qiaolin Ye, Liyong Fu, and Jun Zhou. 2024. Remoteclip: A vision language foundation model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–16
work page 2024
-
[26]
Huaishao Luo, Lei Ji, Ming Zhong, Yang Chen, Wen Lei, Nan Duan, and Tianrui Li. 2022. CLIP4Clip: An Empirical Study of CLIP for End to End Video Clip Retrieval and Captioning.Neurocomputing508 (2022), 293–304. doi:10.1016/j. neucom.2022.07.028
work page doi:10.1016/j 2022
-
[27]
Zehong Ma, Hao Chen, Wei Zeng, Limin Su, and Shiliang Zhang. 2025. Multi- Modal Reference Learning for Fine-Grained Text-to-Image Retrieval.IEEE Trans- actions on Multimedia27 (2025), 5009–5022. doi:10.1109/TMM.2025.3543066
- [28]
-
[29]
Muhammad Ferjad Naeem, Yongqin Xian, Xiaohua Zhai, Lukas Hoyer, Luc Van Gool, and Federico Tombari. 2024. SILC: Improving Vision Language Pre- training with Self-Distillation. InComputer Vision – ECCV 2024. 38–56
work page 2024
-
[30]
Maxime Oquab, Timothée Darcet, Théo Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, Mahmoud Assran, Nicolas Ballas, Wojciech Galuba, Russell Howes, Po-Yao Huang, Shang-Wen Li, Ishan Misra, Michael Rabbat, Vasu Sharma, Gabriel Synnaeve, Hu Xu, Hervé Jegou, Julien Mairal, Patrick Lab...
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[31]
Pengxiang Ouyang, Qing Ma, and Cong Bai. 2025. Sparse Information Percep- tion Network for Remote Sensing Cross-Modal Retrieval.IEEE Transactions on Geoscience and Remote Sensing(2025)
work page 2025
-
[32]
Jiancheng Pan, Qing Ma, and Cong Bai. 2023. A prior instruction representation framework for remote sensing image-text retrieval. InProceedings of the 31st ACM International Conference on Multimedia. 611–620
work page 2023
-
[33]
Xingyu Qin et al. 2025. CLIP is Almost All You Need: Towards Parameter-Efficient Scene Text Retrieval without OCR. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
work page 2025
-
[34]
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. 2021. Learning transferable visual models from natural language supervision. InInternational conference on machine learning. PmLR, 8748–8763
work page 2021
-
[35]
Furao Shen, Xinwang Liu, Zheng Zeng, et al . 2017. Two-Stage Reranking for Remote Sensing Image Retrieval.IEEE Geoscience and Remote Sensing Letters (2017)
work page 2017
-
[36]
Sagar Soni, Akshay Dudhane, Hiyam Debary, Mustansar Fiaz, Muhammad Akhtar Munir, Muhammad Sohail Danish, Paolo Fraccaro, Campbell D Watson, Levente J Klein, Fahad Shahbaz Khan, et al. 2025. Earthdial: Turning multi-sensory earth observations to interactive dialogues. InProceedings of the Computer Vision and Pattern Recognition Conference. 14303–14313
work page 2025
- [37]
-
[38]
Rong-Cheng Tu, Yatai Ji, Jie Jiang, Weijie Kong, Chengfei Cai, Wenzhe Zhao, Hongfa Wang, Yujiu Yang, and Wei Liu. 2025. Global and Local Semantic Com- pletion Learning for Vision-Language Pre-Training.IEEE Transactions on Pattern Analysis and Machine Intelligence47, 12 (2025), 11065–11079
work page 2025
-
[39]
Jingyao Wang, Zheng Liu, Shanshan Gao, Junhao Xu, and Changhao Li. 2025. From external to internal: Step-wise feature enhancement network for image-text retrieval.Neural Networks(2025). doi:10.1016/j.neunet.2025.108072
-
[40]
Yi Wang, Zhitong Xiong, Chenying Liu, Adam J. Stewart, Thomas Dujardin, Nikolaos Ioannis Bountos, Angelos Zavras, Franziska Gerken, Ioannis Papoutsis, Laura Leal-Taixé, and Xiao Xiang Zhu. 2025. Towards a Unified Copernicus Conference’17, July 2017, Washington, DC, USA X. Chen et al. Foundation Model for Earth Vision. InProceedings of the IEEE/CVF Interna...
work page 2025
-
[41]
Zihao Wang, Xihui Liu, Hongsheng Li, Lu Sheng, Junjie Yan, Xiaogang Wang, and Jing Shao. 2019. Camp: Cross-modal adaptive message passing for text-image retrieval. InProceedings of the IEEE/CVF international conference on computer vision. 5764–5773
work page 2019
-
[42]
Zhecheng Wang, Rajanie Prabha, Tianyuan Huang, Jiajun Wu, and Ram Rajagopal
-
[43]
InProceedings of the AAAI Conference on Artificial Intelligence, Vol
Skyscript: A large and semantically diverse vision-language dataset for remote sensing. InProceedings of the AAAI Conference on Artificial Intelligence, Vol. 38. 5805–5813
- [44]
- [45]
-
[46]
Shin’ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, and Daiki Chijiwa. 2025. Post-pre-training for Modality Alignment in Vision-Language Foundation Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
work page 2025
-
[47]
Shin’ya Yamaguchi, Dewei Feng, Sekitoshi Kanai, Kazuki Adachi, and Daiki Chijiwa. 2025. Post-pre-training for Modality Alignment in Vision-Language Foundation Models. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
work page 2025
-
[48]
Lingling Yang, Tongqing Zhou, Wentao Ma, Mengze Du, Lu Liu, Feng Li, Shan Zhao, and Yuwei Wang. 2024. Remote sensing image-text retrieval with implicit- explicit relation reasoning.IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–11
work page 2024
-
[49]
Rui Yang, Shuang Wang, Yingping Han, Yuanheng Li, Dong Zhao, Dou Quan, Yanhe Guo, Licheng Jiao, and Zhi Yang. 2024. Transcending fusion: A multiscale alignment method for remote sensing image–text retrieval.IEEE Transactions on Geoscience and Remote Sensing62 (2024), 1–17
work page 2024
-
[50]
Wei Yang, Jingjing Fu, Rui Wang, Jinyu Wang, Lei Song, and Jiang Bian. 2025. OMGM: Orchestrate Multiple Granularities and Modalities for Efficient Multi- modal Retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 24545–24563
work page 2025
-
[51]
Xiaolei Yang et al . 2022. Remote Sensing Cross-Modal Text-Image Retrieval Based on Global and Local Information.Remote Sensing14, 9 (2022), 2135
work page 2022
-
[52]
Yuan Yuan, Yang Zhan, and Zhitong Xiong. 2023. Parameter-efficient transfer learning for remote sensing image–text retrieval.IEEE Transactions on Geoscience and Remote Sensing61 (2023), 1–14
work page 2023
- [53]
-
[54]
Zhiqiang Yuan, Wenkai Zhang, Xuee Rong, Xuan Li, Jialiang Chen, Hongqi Wang, Kun Fu, and Xian Sun. 2021. A lightweight multi-scale crossmodal text-image retrieval method in remote sensing.IEEE Transactions on Geoscience and Remote Sensing60 (2021), 1–19
work page 2021
-
[55]
Zhiqiang Yuan, Wenkai Zhang, Changyuan Tian, Xuee Rong, Zhengyuan Zhang, Hongqi Wang, Kun Fu, and Xian Sun. 2022. Remote sensing cross-modal text- image retrieval based on global and local information.IEEE Transactions on Geoscience and Remote Sensing60 (2022), 1–16
work page 2022
-
[56]
Jinxu Zhang, Yongqi Yu, and Yu Zhang. 2024. CREAM: Coarse-to-Fine Retrieval and Multi-modal Efficient Tuning for Document VQA. InProceedings of the 32nd ACM International Conference on Multimedia
work page 2024
-
[57]
Xiong Zhang, Weipeng Li, Xu Wang, Luyao Wang, Fuzhong Zheng, Long Wang, and Haisu Zhang. 2023. A fusion encoder with multi-task guidance for cross- modal text–image retrieval in remote sensing.Remote Sensing15, 18 (2023), 4637
work page 2023
-
[58]
Zilun Zhang, Tiancheng Zhao, Yulong Guo, and Jianwei Yin. 2024. RS5M and GeoRSCLIP: A large-scale vision-language dataset and a large vision-language model for remote sensing.IEEE Transactions on Geoscience and Remote Sensing 62 (2024), 1–23
work page 2024
-
[59]
Shengwei Zhao, Linhai Xu, Yuying Liu, and Shaoyi Du. 2023. Multi-grained Representation Learning for Cross-modal Retrieval. InProceedings of the 46th International ACM SIGIR Conference on Research and Development in Information Retrieval. 2194–2198
work page 2023
-
[60]
Chengyu Zheng, Jie Nie, Bo Yin, Xiu Li, Yuntao Qian, and Zhiqiang Wei. 2025. Frequency and Spatial-domain Saliency Network for Remote Sensing Cross- Modal Retrieval.IEEE Transactions on Geoscience and Remote Sensing(2025)
work page 2025
-
[61]
Fuzhong Zheng, Xu Wang, Luyao Wang, Xiong Zhang, Hongze Zhu, Long Wang, and Haisu Zhang. 2023. A fine-grained semantic alignment method specific to aggregate multi-scale information for cross-modal remote sensing image retrieval. Sensors23, 20 (2023), 8437
work page 2023
-
[62]
Junjie Zhou, Yongping Xiong, Zheng Liu, Ze Liu, Shitao Xiao, Yueze Wang, Bo Zhao, Chen Jason Zhang, and Defu Lian. 2025. MegaPairs: Massive Data Synthesis for Universal Multimodal Retrieval. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). Association for Computational Linguistics, Vienna, ...
-
[63]
Jun Zhu et al. 2025. Unified semantic space learning for cross-modal retrieval. Neural Networks(2025)
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.