Recognition: unknown
Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization
Pith reviewed 2026-05-10 15:34 UTC · model grok-4.3
The pith
MLLMs can be adapted for natural language guided geo-localization through parameter-efficient fine-tuning to achieve state-of-the-art retrieval performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Optimizing latent representations within the MLLM while preserving its pretrained multimodal knowledge enables strong cross-modal alignment for natural-language guided cross-view geo-localization without redesigning model architectures.
What carries the argument
Parameter-efficient finetuning that optimizes latent representations in MLLMs while preserving pretrained multimodal knowledge.
If this is right
- MLLMs become a viable and scalable base for semantic cross-view retrieval instead of dual-encoder architectures.
- Strong cross-modal alignment is possible without complex new model designs.
- Performance improvements on GeoText-1652 and CVG-Text occur with substantially fewer trainable parameters.
- Systematic variation of backbone and aggregation choices yields reusable guidelines for MLLM use in retrieval.
Where Pith is reading between the lines
- The same tuning pattern may transfer to other text-guided retrieval settings such as medical or aerial video matching.
- Fewer parameters could lower the barrier for deploying retrieval systems in resource-limited environments.
- Preserving general knowledge during adaptation may improve handling of ambiguous or complex scene descriptions.
Load-bearing premise
That optimizing latent representations within the MLLM while preserving its pretrained multimodal knowledge enables strong cross-modal alignment without redesigning model architectures.
What would settle it
A direct comparison on a new held-out NGCG benchmark showing the adapted MLLM achieves no better or worse Recall@1 than a standard CLIP dual-encoder baseline.
Figures
read the original abstract
Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter-efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross-modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we provide practical and generalizable insights for leveraging MLLMs in NGCG. Our method achieves SOTA on GeoText-1652 with a 12.2% improvement in Text-to-Image Recall@1 and secures top performance in 5 out of 12 subtasks on CVG-Text, all while surpassing baselines with far fewer trainable parameters. These results position MLLMs as a robust foundation for semantic cross-view retrieval and pave the way for MLLM-based NGCG to be adopted as a scalable, powerful alternative to traditional dual-encoder designs. Project page and code are available at https://yuqichen888.github.io/NGCG-MLLMs-web/.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes adapting Multimodal Large Language Models (MLLMs) for Natural Language-Guided Cross-view Geo-localization (NGCG) via parameter-efficient fine-tuning that optimizes latent representations while preserving pretrained multimodal knowledge. This enables cross-modal alignment for text-to-image retrieval of geo-tagged satellite imagery without architecture redesign. Systematic analyses cover model backbones and feature aggregation. The method reports SOTA on GeoText-1652 (12.2% Text-to-Image Recall@1 gain) and top performance in 5/12 subtasks on CVG-Text, using far fewer trainable parameters than dual-encoder baselines.
Significance. If the preservation of pretrained knowledge holds and the gains prove robust, the work would be significant for shifting NGCG from CLIP-style dual encoders toward more semantically capable MLLM foundations. The parameter-efficient approach and public code release (noted in the abstract) are practical strengths that support reproducibility and generalizability. The systematic variable analysis offers concrete insights for future MLLM-based retrieval.
major comments (1)
- [Abstract and §3] Abstract and §3 (Method): The central claim that parameter-efficient fine-tuning 'preserves its pretrained multimodal knowledge' (enabling leverage of semantic reasoning for retrieval) lacks supporting evidence. No before/after evaluations on held-out general multimodal tasks (VQA, captioning, or visual reasoning) are reported; all metrics are NGCG-specific (GeoText-1652 Recall@1 and CVG-Text subtasks). This is load-bearing, as the superiority to dual-encoder baselines is explicitly attributed to retention of pretrained capabilities rather than task-specific specialization.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for identifying this important point about supporting evidence for our central claim. We respond point by point below.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that parameter-efficient fine-tuning 'preserves its pretrained multimodal knowledge' (enabling leverage of semantic reasoning for retrieval) lacks supporting evidence. No before/after evaluations on held-out general multimodal tasks (VQA, captioning, or visual reasoning) are reported; all metrics are NGCG-specific (GeoText-1652 Recall@1 and CVG-Text subtasks). This is load-bearing, as the superiority to dual-encoder baselines is explicitly attributed to retention of pretrained capabilities rather than task-specific specialization.
Authors: We appreciate the referee highlighting the need for stronger substantiation of the knowledge-preservation claim. Our method applies parameter-efficient fine-tuning (LoRA-style adaptation) that freezes the overwhelming majority of the MLLM parameters and updates only a small subset. This design is explicitly motivated by the desire to retain pretrained multimodal reasoning while adapting the latent space for retrieval. Although we do not report before/after scores on VQA or captioning, the 12.2% Recall@1 improvement over dual-encoder baselines—which lack access to the same rich semantic priors—provides indirect evidence that the pretrained capabilities are being leveraged rather than overwritten by task-specific specialization. In the revised manuscript we will expand §3 with additional discussion of the PEFT literature on knowledge retention and will add a clarifying paragraph in the abstract and method sections to better separate the design rationale from the empirical results. revision: yes
Circularity Check
No circularity: empirical adaptation study with independent experimental results
full rationale
The paper presents an empirical framework for adapting MLLMs to NGCG via parameter-efficient finetuning, reporting SOTA Recall@1 gains on GeoText-1652 and CVG-Text subtasks. No equations, predictions, or first-principles derivations appear that reduce reported metrics to quantities defined by the paper's own fitted parameters or self-referential definitions. The approach is described as optimizing latent representations while preserving pretrained knowledge, but this is an empirical claim evaluated directly on task-specific benchmarks rather than a tautological construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central result. The derivation chain is self-contained as a standard adaptation experiment.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 3
work page internal anchor Pith review arXiv 2024
-
[2]
Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,
work page internal anchor Pith review Pith/arXiv arXiv
-
[3]
Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments
Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,
-
[4]
Cross-view meets diffusion: Aerial im- age synthesis with geometry and text guidance
Ahmad Arrabi, Xiaohan Zhang, Waqas Sultani, Chen Chen, and Safwan Wshah. Cross-view meets diffusion: Aerial im- age synthesis with geometry and text guidance. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5356–5366. IEEE, 2025. 3
2025
-
[5]
Parishad BehnamGhader, Vaibhav Adlakha, Marius Mos- bach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly pow- erful text encoders.arXiv preprint arXiv:2404.05961, 2024. 4
-
[6]
Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss
Sudong Cai, Yulan Guo, Salman Khan, Jiwei Hu, and Gongjian Wen. Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 8391–8400, 2019. 3
2019
-
[7]
End-to- end object detection with transformers
Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 4
2020
-
[8]
A simple framework for contrastive learning of visual representations
Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 1
2020
-
[9]
Uniter: Universal image-text representation learning
Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. InEuropean conference on computer vision, pages 104–120. Springer,
-
[10]
Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching
Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching. In European Conference on Computer Vision, pages 213–231. Springer, 2024. 1, 2, 3, 4, 5
2024
-
[11]
Sam- ple4geo: Hard negative sampling for cross-view geo- localisation
Fabian Deuser, Konrad Habel, and Norbert Oswald. Sam- ple4geo: Hard negative sampling for cross-view geo- localisation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16847–16856, 2023. 2, 3, 4, 7
2023
-
[12]
An empirical study of training end-to-end vision-and-language transformers
Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end-to-end vision-and-language transformers. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18166–18176, 2022. 5
2022
-
[13]
SimCSE: Simple Contrastive Learning of Sentence Embeddings
Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821, 2021. 4
work page internal anchor Pith review arXiv 2021
-
[14]
Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery
Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, et al. Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27672–27683, 2024. 3
2024
-
[15]
Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022
Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 4
2022
-
[16]
Cvm-net: Cross-view matching network for image- based ground-to-aerial geo-localization
Sixing Hu, Mengdan Feng, Rang MH Nguyen, and Gim Hee Lee. Cvm-net: Cross-view matching network for image- based ground-to-aerial geo-localization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7258–7267, 2018. 3
2018
-
[17]
Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Chunyu Wang, Xiyang Dai, Dongdong Chen, et al. Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024. 2
-
[18]
Perceiver io: A general architecture for structured inputs & outputs
Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. InInternational Conference on Learning Represen- tations. 4
-
[19]
Vlm2vec: Training vision-language models for massive multimodal embedding tasks
Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. InICLR,
-
[20]
Adaptive latent diffusion model for 3d medical image to image translation: Multi- modal magnetic resonance imaging study
Jonghun Kim and Hyunjin Park. Adaptive latent diffusion model for 3d medical image to image translation: Multi- modal magnetic resonance imaging study. InProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 7604–7613, 2024. 3
2024
-
[21]
Vilt: Vision- and-language transformer without convolution or region su- pervision
Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 6
2021
-
[22]
Multi- modal data-efficient 3d scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025
Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, and Ziwei Liu. Multi- modal data-efficient 3d scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3
2025
-
[23]
Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021
Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 5
2021
-
[24]
Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 6
2022
-
[25]
Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models
Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2, 4, 5
2023
-
[26]
Siting Li, Pang Wei Koh, and Simon Shaolei Du. Exploring how generative mllms perceive more than clip with the same vision encoder.arXiv preprint arXiv:2411.05195, 2024. 2
-
[27]
Cross-view image geolocalization
Tsung-Yi Lin, Serge Belongie, and James Hays. Cross-view image geolocalization. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 891–898, 2013. 1
2013
-
[28]
Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023
Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2
2023
-
[29]
Lending orientation to neural networks for cross-view geo-localization
Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo-localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5624–5633, 2019. 1, 3
2019
-
[30]
Delving into multi-modal multi-task foun- dation models for road scene understanding: From learning paradigm perspectives.IEEE Transactions on Intelligent Ve- hicles, 2024
Sheng Luo, Wei Chen, Wanxin Tian, Rui Liu, Luanxuan Hou, Xiubao Zhang, Haifeng Shen, Ruiqi Wu, Shuyi Geng, Yi Zhou, et al. Delving into multi-modal multi-task foun- dation models for road scene understanding: From learning paradigm perspectives.IEEE Transactions on Intelligent Ve- hicles, 2024. 3
2024
-
[31]
SmolVLM: Redefining small and efficient multimodal models
Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 3, 5
work page internal anchor Pith review arXiv 2025
-
[32]
In defense of dual-encoders for neural ranking
Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Se- ungyeon Kim, Sashank Reddi, and Sanjiv Kumar. In defense of dual-encoders for neural ranking. InInternational Con- ference on Machine Learning, pages 15376–15400. PMLR,
-
[33]
Representation Learning with Contrastive Predictive Coding
Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 1, 4
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[34]
Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022
Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 2
2022
-
[35]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763, 2021. 1, 5, 6
2021
-
[36]
Cross-view image synthesis using conditional gans
Krishna Regmi and Ali Borji. Cross-view image synthesis using conditional gans. InProceedings of the IEEE con- ference on Computer Vision and Pattern Recognition, pages 3501–3510, 2018. 3
2018
-
[37]
Multi- modal vision pre-training for medical image analysis
Shaohao Rui, Lingzhi Chen, Zhenyu Tang, Lilong Wang, Mianxin Liu, Shaoting Zhang, and Xiaosong Wang. Multi- modal vision pre-training for medical image analysis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5164–5174, 2025. 3
2025
-
[38]
Toolformer: Lan- guage models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551,
Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551,
-
[39]
Spatial- aware feature aggregation for image based cross-view geo- localization.Advances in Neural Information Processing Systems, 32, 2019
Yujiao Shi, Liu Liu, Xin Yu, and Hongdong Li. Spatial- aware feature aggregation for image based cross-view geo- localization.Advances in Neural Information Processing Systems, 32, 2019. 3
2019
-
[40]
Optimal feature transport for cross-view image geo- localization
Yujiao Shi, Xin Yu, Liu Liu, Tong Zhang, and Hongdong Li. Optimal feature transport for cross-view image geo- localization. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 11990–11997, 2020. 3
2020
-
[41]
Gemini: A Family of Highly Capable Multimodal Models
Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 2
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[42]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 3
2017
-
[43]
Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution
Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[44]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2, 3, 5
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[45]
Image and object geo-localization.International Journal of Computer Vision, 132(4):1350–1392, 2024
Daniel Wilson, Xiaohan Zhang, Waqas Sultani, and Safwan Wshah. Image and object geo-localization.International Journal of Computer Vision, 132(4):1350–1392, 2024. 1, 3
2024
-
[46]
Wide-area image geolocalization with aerial reference im- agery
Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference im- agery. InIEEE International Conference on Computer Vi- sion (ICCV), pages 1–9, 2015. 1, 4
2015
-
[47]
A semantic-enhanced multi-modal remote sens- ing foundation model for earth observation.Nature Machine Intelligence, pages 1–15, 2025
Kang Wu, Yingying Zhang, Lixiang Ru, Bo Dang, Jiang- wei Lao, Lei Yu, Junwei Luo, Zifan Zhu, Yue Sun, Jiahao Zhang, et al. A semantic-enhanced multi-modal remote sens- ing foundation model for earth observation.Nature Machine Intelligence, pages 1–15, 2025. 3
2025
-
[48]
Cross-view panorama image synthesis.IEEE Transactions on Multimedia, 25: 3546–3559, 2022
Songsong Wu, Hao Tang, Xiao-Yuan Jing, Haifeng Zhao, Jianjun Qian, Nicu Sebe, and Yan Yan. Cross-view panorama image synthesis.IEEE Transactions on Multimedia, 25: 3546–3559, 2022. 3
2022
-
[49]
Cross-view geo-localization with layer-to-layer transformer.Advances in Neural Information Processing Systems, 34:29009–29020,
Hongji Yang, Xiufan Lu, and Yingying Zhu. Cross-view geo-localization with layer-to-layer transformer.Advances in Neural Information Processing Systems, 34:29009–29020,
-
[50]
Cross-view image geo- localization with panorama-bev co-retrieval network
Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, and Conghui He. Cross-view image geo- localization with panorama-bev co-retrieval network. In European Conference on Computer Vision, pages 74–90. Springer, 2024. 3, 4, 7
2024
-
[51]
Where am i? cross-view geo-localization with natural language descrip- tions
Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, and Weijia Li. Where am i? cross-view geo-localization with natural language descrip- tions. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 5890–5900, 2025. 1, 2, 3, 4, 5, 6, 7
2025
-
[52]
Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts
Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts. InInternational Conference on Machine Learning, pages 25994–26009. PMLR, 2022. 5, 6
2022
-
[53]
Sigmoid loss for language image pre-training
Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 5, 6
2023
-
[54]
Cross-view geo-localization via learning disentangled geometric layout correspondence
Xiaohan Zhang, Xingyu Li, Waqas Sultani, Yi Zhou, and Safwan Wshah. Cross-view geo-localization via learning disentangled geometric layout correspondence. InProceed- ings of the AAAI conference on artificial intelligence, pages 3480–3488, 2023. 3
2023
-
[55]
Cross- view image sequence geo-localization
Xiaohan Zhang, Waqas Sultani, and Safwan Wshah. Cross- view image sequence geo-localization. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2914–2923, 2023
2023
-
[56]
Geodtr+: Toward generic cross-view ge- olocalization via geometric disentanglement.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024
Xiaohan Zhang, Xingyu Li, Waqas Sultani, Chen Chen, and Safwan Wshah. Geodtr+: Toward generic cross-view ge- olocalization via geometric disentanglement.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024. 3
2024
-
[57]
Vici: Vlm-instructed cross-view image-localisation
Xiaohan Zhang, Tavis Shore, Chen Chen, Oscar Mendez, Simon Hadfield, and Safwan Wshah. Vici: Vlm-instructed cross-view image-localisation. InProceedings of the 3rd In- ternational Workshop on UAVs in Multimedia: Capturing the World from a New Perspective, pages 21–25, 2025. 3
2025
-
[58]
University- 1652: A multi-view multi-source benchmark for drone- based geo-localization
Zhedong Zheng, Yunchao Wei, and Yi Yang. University- 1652: A multi-view multi-source benchmark for drone- based geo-localization. InProceedings of the 28th ACM international conference on Multimedia, pages 1395–1403,
-
[59]
Vigor: Cross- view image geo-localization beyond one-to-one retrieval
Sijie Zhu, Taojiannan Yang, and Chen Chen. Vigor: Cross- view image geo-localization beyond one-to-one retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3640–3649, 2021. 1, 4
2021
-
[60]
Transgeo: Trans- former is all you need for cross-view image geo-localization
Sijie Zhu, Mubarak Shah, and Chen Chen. Transgeo: Trans- former is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022. 3
2022
-
[61]
Simple, effective and general: A new backbone for cross-view image geo-localization,
Yingying Zhu, Hongji Yang, Yuxin Lu, and Qiang Huang. Simple, effective and general: A new back- bone for cross-view image geo-localization.arXiv preprint arXiv:2302.01572, 2023. 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.