pith. machine review for the scientific record. sign in

arxiv: 2604.10721 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.AI

Recognition: unknown

Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords natural language guided geo-localizationmultimodal large language modelscross-view retrievalparameter-efficient finetuningsatellite image retrievaltext-to-image matching
0
0 comments X

The pith

MLLMs can be adapted for natural language guided geo-localization through parameter-efficient fine-tuning to achieve state-of-the-art retrieval performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how multimodal large language models can be repurposed to retrieve geo-tagged satellite images from textual descriptions of ground scenes. Standard dual-encoder methods like CLIP often lack strong generalization and demand elaborate designs. The approach uses parameter-efficient tuning to refine internal representations in an MLLM without losing its original multimodal abilities, creating effective text-image alignment. This delivers better benchmark results using far fewer trainable parameters than prior work.

Core claim

Optimizing latent representations within the MLLM while preserving its pretrained multimodal knowledge enables strong cross-modal alignment for natural-language guided cross-view geo-localization without redesigning model architectures.

What carries the argument

Parameter-efficient finetuning that optimizes latent representations in MLLMs while preserving pretrained multimodal knowledge.

If this is right

  • MLLMs become a viable and scalable base for semantic cross-view retrieval instead of dual-encoder architectures.
  • Strong cross-modal alignment is possible without complex new model designs.
  • Performance improvements on GeoText-1652 and CVG-Text occur with substantially fewer trainable parameters.
  • Systematic variation of backbone and aggregation choices yields reusable guidelines for MLLM use in retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same tuning pattern may transfer to other text-guided retrieval settings such as medical or aerial video matching.
  • Fewer parameters could lower the barrier for deploying retrieval systems in resource-limited environments.
  • Preserving general knowledge during adaptation may improve handling of ambiguous or complex scene descriptions.

Load-bearing premise

That optimizing latent representations within the MLLM while preserving its pretrained multimodal knowledge enables strong cross-modal alignment without redesigning model architectures.

What would settle it

A direct comparison on a new held-out NGCG benchmark showing the adapted MLLM achieves no better or worse Recall@1 than a standard CLIP dual-encoder baseline.

Figures

Figures reproduced from arXiv: 2604.10721 by Ahmad Arrabi, Chen Chen, Safwan Wshah, Waqas Sultani, Xiaohan Zhang, Yuqi Chen.

Figure 1
Figure 1. Figure 1: Comparison between the CLIP-style Dual-Encoder Ar [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The framework leverages a pre-trained MLLM for fea [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visualization of Text-to-Satellite Image Retrieval Results on CVG-Text. Four text-satellite pairs are shown. Left is the query text. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
read the original abstract

Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter-efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross-modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we provide practical and generalizable insights for leveraging MLLMs in NGCG. Our method achieves SOTA on GeoText-1652 with a 12.2% improvement in Text-to-Image Recall@1 and secures top performance in 5 out of 12 subtasks on CVG-Text, all while surpassing baselines with far fewer trainable parameters. These results position MLLMs as a robust foundation for semantic cross-view retrieval and pave the way for MLLM-based NGCG to be adopted as a scalable, powerful alternative to traditional dual-encoder designs. Project page and code are available at https://yuqichen888.github.io/NGCG-MLLMs-web/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 0 minor

Summary. The paper proposes adapting Multimodal Large Language Models (MLLMs) for Natural Language-Guided Cross-view Geo-localization (NGCG) via parameter-efficient fine-tuning that optimizes latent representations while preserving pretrained multimodal knowledge. This enables cross-modal alignment for text-to-image retrieval of geo-tagged satellite imagery without architecture redesign. Systematic analyses cover model backbones and feature aggregation. The method reports SOTA on GeoText-1652 (12.2% Text-to-Image Recall@1 gain) and top performance in 5/12 subtasks on CVG-Text, using far fewer trainable parameters than dual-encoder baselines.

Significance. If the preservation of pretrained knowledge holds and the gains prove robust, the work would be significant for shifting NGCG from CLIP-style dual encoders toward more semantically capable MLLM foundations. The parameter-efficient approach and public code release (noted in the abstract) are practical strengths that support reproducibility and generalizability. The systematic variable analysis offers concrete insights for future MLLM-based retrieval.

major comments (1)
  1. [Abstract and §3] Abstract and §3 (Method): The central claim that parameter-efficient fine-tuning 'preserves its pretrained multimodal knowledge' (enabling leverage of semantic reasoning for retrieval) lacks supporting evidence. No before/after evaluations on held-out general multimodal tasks (VQA, captioning, or visual reasoning) are reported; all metrics are NGCG-specific (GeoText-1652 Recall@1 and CVG-Text subtasks). This is load-bearing, as the superiority to dual-encoder baselines is explicitly attributed to retention of pretrained capabilities rather than task-specific specialization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying this important point about supporting evidence for our central claim. We respond point by point below.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that parameter-efficient fine-tuning 'preserves its pretrained multimodal knowledge' (enabling leverage of semantic reasoning for retrieval) lacks supporting evidence. No before/after evaluations on held-out general multimodal tasks (VQA, captioning, or visual reasoning) are reported; all metrics are NGCG-specific (GeoText-1652 Recall@1 and CVG-Text subtasks). This is load-bearing, as the superiority to dual-encoder baselines is explicitly attributed to retention of pretrained capabilities rather than task-specific specialization.

    Authors: We appreciate the referee highlighting the need for stronger substantiation of the knowledge-preservation claim. Our method applies parameter-efficient fine-tuning (LoRA-style adaptation) that freezes the overwhelming majority of the MLLM parameters and updates only a small subset. This design is explicitly motivated by the desire to retain pretrained multimodal reasoning while adapting the latent space for retrieval. Although we do not report before/after scores on VQA or captioning, the 12.2% Recall@1 improvement over dual-encoder baselines—which lack access to the same rich semantic priors—provides indirect evidence that the pretrained capabilities are being leveraged rather than overwritten by task-specific specialization. In the revised manuscript we will expand §3 with additional discussion of the PEFT literature on knowledge retention and will add a clarifying paragraph in the abstract and method sections to better separate the design rationale from the empirical results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical adaptation study with independent experimental results

full rationale

The paper presents an empirical framework for adapting MLLMs to NGCG via parameter-efficient finetuning, reporting SOTA Recall@1 gains on GeoText-1652 and CVG-Text subtasks. No equations, predictions, or first-principles derivations appear that reduce reported metrics to quantities defined by the paper's own fitted parameters or self-referential definitions. The approach is described as optimizing latent representations while preserving pretrained knowledge, but this is an empirical claim evaluated directly on task-specific benchmarks rather than a tautological construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central result. The derivation chain is self-contained as a standard adaptation experiment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view prevents enumeration of exact hyperparameters or training choices; the approach implicitly relies on standard assumptions of parameter-efficient tuning preserving pretrained capabilities.

pith-pipeline@v0.9.0 · 5584 in / 1033 out tokens · 62571 ms · 2026-05-10T15:34:09.918691+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

61 extracted references · 12 canonical work pages · 8 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 3

  2. [2]

    GPT-4 Technical Report

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  3. [3]

    Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

    Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,

  4. [4]

    Cross-view meets diffusion: Aerial im- age synthesis with geometry and text guidance

    Ahmad Arrabi, Xiaohan Zhang, Waqas Sultani, Chen Chen, and Safwan Wshah. Cross-view meets diffusion: Aerial im- age synthesis with geometry and text guidance. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5356–5366. IEEE, 2025. 3

  5. [5]

    Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

    Parishad BehnamGhader, Vaibhav Adlakha, Marius Mos- bach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly pow- erful text encoders.arXiv preprint arXiv:2404.05961, 2024. 4

  6. [6]

    Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss

    Sudong Cai, Yulan Guo, Salman Khan, Jiwei Hu, and Gongjian Wen. Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 8391–8400, 2019. 3

  7. [7]

    End-to- end object detection with transformers

    Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 4

  8. [8]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 1

  9. [9]

    Uniter: Universal image-text representation learning

    Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. InEuropean conference on computer vision, pages 104–120. Springer,

  10. [10]

    Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching

    Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching. In European Conference on Computer Vision, pages 213–231. Springer, 2024. 1, 2, 3, 4, 5

  11. [11]

    Sam- ple4geo: Hard negative sampling for cross-view geo- localisation

    Fabian Deuser, Konrad Habel, and Norbert Oswald. Sam- ple4geo: Hard negative sampling for cross-view geo- localisation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16847–16856, 2023. 2, 3, 4, 7

  12. [12]

    An empirical study of training end-to-end vision-and-language transformers

    Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end-to-end vision-and-language transformers. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18166–18176, 2022. 5

  13. [13]

    SimCSE: Simple Contrastive Learning of Sentence Embeddings

    Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821, 2021. 4

  14. [14]

    Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery

    Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, et al. Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27672–27683, 2024. 3

  15. [15]

    Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 4

  16. [16]

    Cvm-net: Cross-view matching network for image- based ground-to-aerial geo-localization

    Sixing Hu, Mengdan Feng, Rang MH Nguyen, and Gim Hee Lee. Cvm-net: Cross-view matching network for image- based ground-to-aerial geo-localization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7258–7267, 2018. 3

  17. [17]

    Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024

    Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Chunyu Wang, Xiyang Dai, Dongdong Chen, et al. Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024. 2

  18. [18]

    Perceiver io: A general architecture for structured inputs & outputs

    Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. InInternational Conference on Learning Represen- tations. 4

  19. [19]

    Vlm2vec: Training vision-language models for massive multimodal embedding tasks

    Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. InICLR,

  20. [20]

    Adaptive latent diffusion model for 3d medical image to image translation: Multi- modal magnetic resonance imaging study

    Jonghun Kim and Hyunjin Park. Adaptive latent diffusion model for 3d medical image to image translation: Multi- modal magnetic resonance imaging study. InProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 7604–7613, 2024. 3

  21. [21]

    Vilt: Vision- and-language transformer without convolution or region su- pervision

    Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 6

  22. [22]

    Multi- modal data-efficient 3d scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

    Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, and Ziwei Liu. Multi- modal data-efficient 3d scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3

  23. [23]

    Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

    Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 5

  24. [24]

    Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

    Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 6

  25. [25]

    Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2, 4, 5

  26. [26]

    Exploring how generative mllms perceive more than clip with the same vision encoder.arXiv preprint arXiv:2411.05195, 2024

    Siting Li, Pang Wei Koh, and Simon Shaolei Du. Exploring how generative mllms perceive more than clip with the same vision encoder.arXiv preprint arXiv:2411.05195, 2024. 2

  27. [27]

    Cross-view image geolocalization

    Tsung-Yi Lin, Serge Belongie, and James Hays. Cross-view image geolocalization. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 891–898, 2013. 1

  28. [28]

    Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

  29. [29]

    Lending orientation to neural networks for cross-view geo-localization

    Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo-localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5624–5633, 2019. 1, 3

  30. [30]

    Delving into multi-modal multi-task foun- dation models for road scene understanding: From learning paradigm perspectives.IEEE Transactions on Intelligent Ve- hicles, 2024

    Sheng Luo, Wei Chen, Wanxin Tian, Rui Liu, Luanxuan Hou, Xiubao Zhang, Haifeng Shen, Ruiqi Wu, Shuyi Geng, Yi Zhou, et al. Delving into multi-modal multi-task foun- dation models for road scene understanding: From learning paradigm perspectives.IEEE Transactions on Intelligent Ve- hicles, 2024. 3

  31. [31]

    SmolVLM: Redefining small and efficient multimodal models

    Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 3, 5

  32. [32]

    In defense of dual-encoders for neural ranking

    Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Se- ungyeon Kim, Sashank Reddi, and Sanjiv Kumar. In defense of dual-encoders for neural ranking. InInternational Con- ference on Machine Learning, pages 15376–15400. PMLR,

  33. [33]

    Representation Learning with Contrastive Predictive Coding

    Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 1, 4

  34. [34]

    Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 2

  35. [35]

    Learning transferable visual models from natural language supervi- sion

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763, 2021. 1, 5, 6

  36. [36]

    Cross-view image synthesis using conditional gans

    Krishna Regmi and Ali Borji. Cross-view image synthesis using conditional gans. InProceedings of the IEEE con- ference on Computer Vision and Pattern Recognition, pages 3501–3510, 2018. 3

  37. [37]

    Multi- modal vision pre-training for medical image analysis

    Shaohao Rui, Lingzhi Chen, Zhenyu Tang, Lilong Wang, Mianxin Liu, Shaoting Zhang, and Xiaosong Wang. Multi- modal vision pre-training for medical image analysis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5164–5174, 2025. 3

  38. [38]

    Toolformer: Lan- guage models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551,

    Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551,

  39. [39]

    Spatial- aware feature aggregation for image based cross-view geo- localization.Advances in Neural Information Processing Systems, 32, 2019

    Yujiao Shi, Liu Liu, Xin Yu, and Hongdong Li. Spatial- aware feature aggregation for image based cross-view geo- localization.Advances in Neural Information Processing Systems, 32, 2019. 3

  40. [40]

    Optimal feature transport for cross-view image geo- localization

    Yujiao Shi, Xin Yu, Liu Liu, Tong Zhang, and Hongdong Li. Optimal feature transport for cross-view image geo- localization. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 11990–11997, 2020. 3

  41. [41]

    Gemini: A Family of Highly Capable Multimodal Models

    Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 2

  42. [42]

    Attention is all you need.Advances in neural information processing systems, 30, 2017

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 3

  43. [43]

    Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

    Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3

  44. [44]

    InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

    Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2, 3, 5

  45. [45]

    Image and object geo-localization.International Journal of Computer Vision, 132(4):1350–1392, 2024

    Daniel Wilson, Xiaohan Zhang, Waqas Sultani, and Safwan Wshah. Image and object geo-localization.International Journal of Computer Vision, 132(4):1350–1392, 2024. 1, 3

  46. [46]

    Wide-area image geolocalization with aerial reference im- agery

    Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference im- agery. InIEEE International Conference on Computer Vi- sion (ICCV), pages 1–9, 2015. 1, 4

  47. [47]

    A semantic-enhanced multi-modal remote sens- ing foundation model for earth observation.Nature Machine Intelligence, pages 1–15, 2025

    Kang Wu, Yingying Zhang, Lixiang Ru, Bo Dang, Jiang- wei Lao, Lei Yu, Junwei Luo, Zifan Zhu, Yue Sun, Jiahao Zhang, et al. A semantic-enhanced multi-modal remote sens- ing foundation model for earth observation.Nature Machine Intelligence, pages 1–15, 2025. 3

  48. [48]

    Cross-view panorama image synthesis.IEEE Transactions on Multimedia, 25: 3546–3559, 2022

    Songsong Wu, Hao Tang, Xiao-Yuan Jing, Haifeng Zhao, Jianjun Qian, Nicu Sebe, and Yan Yan. Cross-view panorama image synthesis.IEEE Transactions on Multimedia, 25: 3546–3559, 2022. 3

  49. [49]

    Cross-view geo-localization with layer-to-layer transformer.Advances in Neural Information Processing Systems, 34:29009–29020,

    Hongji Yang, Xiufan Lu, and Yingying Zhu. Cross-view geo-localization with layer-to-layer transformer.Advances in Neural Information Processing Systems, 34:29009–29020,

  50. [50]

    Cross-view image geo- localization with panorama-bev co-retrieval network

    Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, and Conghui He. Cross-view image geo- localization with panorama-bev co-retrieval network. In European Conference on Computer Vision, pages 74–90. Springer, 2024. 3, 4, 7

  51. [51]

    Where am i? cross-view geo-localization with natural language descrip- tions

    Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, and Weijia Li. Where am i? cross-view geo-localization with natural language descrip- tions. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 5890–5900, 2025. 1, 2, 3, 4, 5, 6, 7

  52. [52]

    Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts

    Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts. InInternational Conference on Machine Learning, pages 25994–26009. PMLR, 2022. 5, 6

  53. [53]

    Sigmoid loss for language image pre-training

    Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 5, 6

  54. [54]

    Cross-view geo-localization via learning disentangled geometric layout correspondence

    Xiaohan Zhang, Xingyu Li, Waqas Sultani, Yi Zhou, and Safwan Wshah. Cross-view geo-localization via learning disentangled geometric layout correspondence. InProceed- ings of the AAAI conference on artificial intelligence, pages 3480–3488, 2023. 3

  55. [55]

    Cross- view image sequence geo-localization

    Xiaohan Zhang, Waqas Sultani, and Safwan Wshah. Cross- view image sequence geo-localization. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2914–2923, 2023

  56. [56]

    Geodtr+: Toward generic cross-view ge- olocalization via geometric disentanglement.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024

    Xiaohan Zhang, Xingyu Li, Waqas Sultani, Chen Chen, and Safwan Wshah. Geodtr+: Toward generic cross-view ge- olocalization via geometric disentanglement.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024. 3

  57. [57]

    Vici: Vlm-instructed cross-view image-localisation

    Xiaohan Zhang, Tavis Shore, Chen Chen, Oscar Mendez, Simon Hadfield, and Safwan Wshah. Vici: Vlm-instructed cross-view image-localisation. InProceedings of the 3rd In- ternational Workshop on UAVs in Multimedia: Capturing the World from a New Perspective, pages 21–25, 2025. 3

  58. [58]

    University- 1652: A multi-view multi-source benchmark for drone- based geo-localization

    Zhedong Zheng, Yunchao Wei, and Yi Yang. University- 1652: A multi-view multi-source benchmark for drone- based geo-localization. InProceedings of the 28th ACM international conference on Multimedia, pages 1395–1403,

  59. [59]

    Vigor: Cross- view image geo-localization beyond one-to-one retrieval

    Sijie Zhu, Taojiannan Yang, and Chen Chen. Vigor: Cross- view image geo-localization beyond one-to-one retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3640–3649, 2021. 1, 4

  60. [60]

    Transgeo: Trans- former is all you need for cross-view image geo-localization

    Sijie Zhu, Mubarak Shah, and Chen Chen. Transgeo: Trans- former is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022. 3

  61. [61]

    Simple, effective and general: A new backbone for cross-view image geo-localization,

    Yingying Zhu, Hongji Yang, Yuxin Lu, and Qiang Huang. Simple, effective and general: A new back- bone for cross-view image geo-localization.arXiv preprint arXiv:2302.01572, 2023. 3