arxiv: 2604.10721 · v1 · submitted 2026-04-12 · 💻 cs.CV · cs.AI

Recognition: unknown

Turning Generators into Retrievers: Unlocking MLLMs for Natural Language-Guided Geo-Localization

Yuqi Chen , Xiaohan Zhang , Ahmad Arrabi , Waqas Sultani , Chen Chen , Safwan Wshah

Authors on Pith no claims yet

Pith reviewed 2026-05-10 15:34 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords natural language guided geo-localizationmultimodal large language modelscross-view retrievalparameter-efficient finetuningsatellite image retrievaltext-to-image matching

0 comments

The pith

MLLMs can be adapted for natural language guided geo-localization through parameter-efficient fine-tuning to achieve state-of-the-art retrieval performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how multimodal large language models can be repurposed to retrieve geo-tagged satellite images from textual descriptions of ground scenes. Standard dual-encoder methods like CLIP often lack strong generalization and demand elaborate designs. The approach uses parameter-efficient tuning to refine internal representations in an MLLM without losing its original multimodal abilities, creating effective text-image alignment. This delivers better benchmark results using far fewer trainable parameters than prior work.

Core claim

Optimizing latent representations within the MLLM while preserving its pretrained multimodal knowledge enables strong cross-modal alignment for natural-language guided cross-view geo-localization without redesigning model architectures.

What carries the argument

Parameter-efficient finetuning that optimizes latent representations in MLLMs while preserving pretrained multimodal knowledge.

If this is right

MLLMs become a viable and scalable base for semantic cross-view retrieval instead of dual-encoder architectures.
Strong cross-modal alignment is possible without complex new model designs.
Performance improvements on GeoText-1652 and CVG-Text occur with substantially fewer trainable parameters.
Systematic variation of backbone and aggregation choices yields reusable guidelines for MLLM use in retrieval.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same tuning pattern may transfer to other text-guided retrieval settings such as medical or aerial video matching.
Fewer parameters could lower the barrier for deploying retrieval systems in resource-limited environments.
Preserving general knowledge during adaptation may improve handling of ambiguous or complex scene descriptions.

Load-bearing premise

That optimizing latent representations within the MLLM while preserving its pretrained multimodal knowledge enables strong cross-modal alignment without redesigning model architectures.

What would settle it

A direct comparison on a new held-out NGCG benchmark showing the adapted MLLM achieves no better or worse Recall@1 than a standard CLIP dual-encoder baseline.

Figures

Figures reproduced from arXiv: 2604.10721 by Ahmad Arrabi, Chen Chen, Safwan Wshah, Waqas Sultani, Xiaohan Zhang, Yuqi Chen.

**Figure 2.** Figure 2: The framework leverages a pre-trained MLLM for fea [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Visualization of Text-to-Satellite Image Retrieval Results on CVG-Text. Four text-satellite pairs are shown. Left is the query text. [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Natural-language Guided Cross-view Geo-localization (NGCG) aims to retrieve geo-tagged satellite imagery using textual descriptions of ground scenes. While recent NGCG methods commonly rely on CLIP-style dual-encoder architectures, they often suffer from weak cross-modal generalization and require complex architectural designs. In contrast, Multimodal Large Language Models (MLLMs) offer powerful semantic reasoning capabilities but are not directly optimized for retrieval tasks. In this work, we present a simple yet effective framework to adapt MLLMs for NGCG via parameter-efficient finetuning. Our approach optimizes latent representations within the MLLM while preserving its pretrained multimodal knowledge, enabling strong cross-modal alignment without redesigning model architectures. Through systematic analysis of diverse variables, from model backbone to feature aggregation, we provide practical and generalizable insights for leveraging MLLMs in NGCG. Our method achieves SOTA on GeoText-1652 with a 12.2% improvement in Text-to-Image Recall@1 and secures top performance in 5 out of 12 subtasks on CVG-Text, all while surpassing baselines with far fewer trainable parameters. These results position MLLMs as a robust foundation for semantic cross-view retrieval and pave the way for MLLM-based NGCG to be adopted as a scalable, powerful alternative to traditional dual-encoder designs. Project page and code are available at https://yuqichen888.github.io/NGCG-MLLMs-web/.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MLLMs adapted via PEFT for NGCG retrieval show benchmark improvements over dual-encoders with fewer parameters, but lack tests confirming preservation of general multimodal abilities.

read the letter

The punchline for this paper is that MLLMs can be repurposed as retrievers for natural language-guided geo-localization through parameter-efficient tuning, achieving better performance than dual-encoder baselines with minimal additional parameters. What is actually new is the specific application of this tuning strategy to the NGCG problem, along with the systematic exploration of variables like model choice and feature handling. The paper does well in delivering concrete numbers and making the code available, which allows others to verify and extend the work. The gains on GeoText-1652 and CVG-Text are presented clearly, and the lower parameter count is a real plus for efficiency. Where it is softer is in the supporting argument for knowledge preservation. The method is said to optimize latent representations while keeping the pretrained multimodal knowledge, but the experiments do not include any evaluation on general tasks outside NGCG. Without before-and-after results on something like visual question answering, it's hard to rule out that the tuning has caused some degree of specialization. This does not invalidate the retrieval results, but it does mean the broader claim about being a robust foundation is less substantiated than it could be. The work is empirical and the numbers appear reproducible given the code release. There is no load-bearing math that is circular. This paper is for researchers interested in adapting large multimodal models to retrieval tasks in specific domains like geo-localization. A reader working on similar problems would get value from the practical insights and the performance comparison. It deserves a serious referee because it introduces a viable alternative approach with supporting experiments, even if additional validation would strengthen it. I would recommend peer review.

Referee Report

1 major / 0 minor

Summary. The paper proposes adapting Multimodal Large Language Models (MLLMs) for Natural Language-Guided Cross-view Geo-localization (NGCG) via parameter-efficient fine-tuning that optimizes latent representations while preserving pretrained multimodal knowledge. This enables cross-modal alignment for text-to-image retrieval of geo-tagged satellite imagery without architecture redesign. Systematic analyses cover model backbones and feature aggregation. The method reports SOTA on GeoText-1652 (12.2% Text-to-Image Recall@1 gain) and top performance in 5/12 subtasks on CVG-Text, using far fewer trainable parameters than dual-encoder baselines.

Significance. If the preservation of pretrained knowledge holds and the gains prove robust, the work would be significant for shifting NGCG from CLIP-style dual encoders toward more semantically capable MLLM foundations. The parameter-efficient approach and public code release (noted in the abstract) are practical strengths that support reproducibility and generalizability. The systematic variable analysis offers concrete insights for future MLLM-based retrieval.

major comments (1)

[Abstract and §3] Abstract and §3 (Method): The central claim that parameter-efficient fine-tuning 'preserves its pretrained multimodal knowledge' (enabling leverage of semantic reasoning for retrieval) lacks supporting evidence. No before/after evaluations on held-out general multimodal tasks (VQA, captioning, or visual reasoning) are reported; all metrics are NGCG-specific (GeoText-1652 Recall@1 and CVG-Text subtasks). This is load-bearing, as the superiority to dual-encoder baselines is explicitly attributed to retention of pretrained capabilities rather than task-specific specialization.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying this important point about supporting evidence for our central claim. We respond point by point below.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that parameter-efficient fine-tuning 'preserves its pretrained multimodal knowledge' (enabling leverage of semantic reasoning for retrieval) lacks supporting evidence. No before/after evaluations on held-out general multimodal tasks (VQA, captioning, or visual reasoning) are reported; all metrics are NGCG-specific (GeoText-1652 Recall@1 and CVG-Text subtasks). This is load-bearing, as the superiority to dual-encoder baselines is explicitly attributed to retention of pretrained capabilities rather than task-specific specialization.

Authors: We appreciate the referee highlighting the need for stronger substantiation of the knowledge-preservation claim. Our method applies parameter-efficient fine-tuning (LoRA-style adaptation) that freezes the overwhelming majority of the MLLM parameters and updates only a small subset. This design is explicitly motivated by the desire to retain pretrained multimodal reasoning while adapting the latent space for retrieval. Although we do not report before/after scores on VQA or captioning, the 12.2% Recall@1 improvement over dual-encoder baselines—which lack access to the same rich semantic priors—provides indirect evidence that the pretrained capabilities are being leveraged rather than overwritten by task-specific specialization. In the revised manuscript we will expand §3 with additional discussion of the PEFT literature on knowledge retention and will add a clarifying paragraph in the abstract and method sections to better separate the design rationale from the empirical results. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical adaptation study with independent experimental results

full rationale

The paper presents an empirical framework for adapting MLLMs to NGCG via parameter-efficient finetuning, reporting SOTA Recall@1 gains on GeoText-1652 and CVG-Text subtasks. No equations, predictions, or first-principles derivations appear that reduce reported metrics to quantities defined by the paper's own fitted parameters or self-referential definitions. The approach is described as optimizing latent representations while preserving pretrained knowledge, but this is an empirical claim evaluated directly on task-specific benchmarks rather than a tautological construction. No load-bearing self-citations, uniqueness theorems, or ansatzes imported from prior author work are invoked to force the central result. The derivation chain is self-contained as a standard adaptation experiment.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only view prevents enumeration of exact hyperparameters or training choices; the approach implicitly relies on standard assumptions of parameter-efficient tuning preserving pretrained capabilities.

pith-pipeline@v0.9.0 · 5584 in / 1033 out tokens · 62571 ms · 2026-05-10T15:34:09.918691+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

61 extracted references · 12 canonical work pages · 8 internal anchors

[1]

Phi-4 Technical Report

Marah Abdin, Jyoti Aneja, Harkirat Behl, S ´ebastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J Hewett, Mojan Javaheripi, Piero Kauffmann, et al. Phi-4 technical report.arXiv preprint arXiv:2412.08905, 2024. 3

work page internal anchor Pith review arXiv 2024
[2]

GPT-4 Technical Report

Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ah- mad, Ilge Akkaya, Florencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments

Peter Anderson, Qi Wu, Damien Teney, Jake Bruce, Mark Johnson, Niko S ¨underhauf, Ian Reid, Stephen Gould, and Anton Van Den Hengel. Vision-and-language navigation: In- terpreting visually-grounded navigation instructions in real environments. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 3674–3683,
[4]

Cross-view meets diffusion: Aerial im- age synthesis with geometry and text guidance

Ahmad Arrabi, Xiaohan Zhang, Waqas Sultani, Chen Chen, and Safwan Wshah. Cross-view meets diffusion: Aerial im- age synthesis with geometry and text guidance. In2025 IEEE/CVF Winter Conference on Applications of Computer Vision (WACV), pages 5356–5366. IEEE, 2025. 3

2025
[5]

Llm2vec: Large language models are secretly powerful text encoders.arXiv preprint arXiv:2404.05961, 2024

Parishad BehnamGhader, Vaibhav Adlakha, Marius Mos- bach, Dzmitry Bahdanau, Nicolas Chapados, and Siva Reddy. Llm2vec: Large language models are secretly pow- erful text encoders.arXiv preprint arXiv:2404.05961, 2024. 4

work page arXiv 2024
[6]

Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss

Sudong Cai, Yulan Guo, Salman Khan, Jiwei Hu, and Gongjian Wen. Ground-to-aerial image geo-localization with a hard exemplar reweighting triplet loss. InProceed- ings of the IEEE/CVF international conference on computer vision, pages 8391–8400, 2019. 3

2019
[7]

End-to- end object detection with transformers

Nicolas Carion, Francisco Massa, Gabriel Synnaeve, Nicolas Usunier, Alexander Kirillov, and Sergey Zagoruyko. End-to- end object detection with transformers. InEuropean confer- ence on computer vision, pages 213–229. Springer, 2020. 4

2020
[8]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Ge- offrey Hinton. A simple framework for contrastive learning of visual representations. InInternational conference on ma- chine learning, pages 1597–1607. PmLR, 2020. 1

2020
[9]

Uniter: Universal image-text representation learning

Yen-Chun Chen, Linjie Li, Licheng Yu, Ahmed El Kholy, Faisal Ahmed, Zhe Gan, Yu Cheng, and Jingjing Liu. Uniter: Universal image-text representation learning. InEuropean conference on computer vision, pages 104–120. Springer,
[10]

Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching

Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching. In European Conference on Computer Vision, pages 213–231. Springer, 2024. 1, 2, 3, 4, 5

2024
[11]

Sam- ple4geo: Hard negative sampling for cross-view geo- localisation

Fabian Deuser, Konrad Habel, and Norbert Oswald. Sam- ple4geo: Hard negative sampling for cross-view geo- localisation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16847–16856, 2023. 2, 3, 4, 7

2023
[12]

An empirical study of training end-to-end vision-and-language transformers

Zi-Yi Dou, Yichong Xu, Zhe Gan, Jianfeng Wang, Shuohang Wang, Lijuan Wang, Chenguang Zhu, Pengchuan Zhang, Lu Yuan, Nanyun Peng, et al. An empirical study of training end-to-end vision-and-language transformers. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pages 18166–18176, 2022. 5

2022
[13]

SimCSE: Simple Contrastive Learning of Sentence Embeddings

Tianyu Gao, Xingcheng Yao, and Danqi Chen. Simcse: Simple contrastive learning of sentence embeddings.arXiv preprint arXiv:2104.08821, 2021. 4

work page internal anchor Pith review arXiv 2021
[14]

Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery

Xin Guo, Jiangwei Lao, Bo Dang, Yingying Zhang, Lei Yu, Lixiang Ru, Liheng Zhong, Ziyuan Huang, Kang Wu, Dingxiang Hu, et al. Skysense: A multi-modal remote sens- ing foundation model towards universal interpretation for earth observation imagery. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 27672–27683, 2024. 3

2024
[15]

Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022

Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen- Zhu, Yuanzhi Li, Shean Wang, Lu Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models.ICLR, 1(2):3, 2022. 4

2022
[16]

Cvm-net: Cross-view matching network for image- based ground-to-aerial geo-localization

Sixing Hu, Mengdan Feng, Rang MH Nguyen, and Gim Hee Lee. Cvm-net: Cross-view matching network for image- based ground-to-aerial geo-localization. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7258–7267, 2018. 3

2018
[17]

Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024

Weiquan Huang, Aoqi Wu, Yifan Yang, Xufang Luo, Yuqing Yang, Liang Hu, Qi Dai, Chunyu Wang, Xiyang Dai, Dongdong Chen, et al. Llm2clip: Powerful language model unlocks richer visual representation.arXiv preprint arXiv:2411.04997, 2024. 2

work page arXiv 2024
[18]

Perceiver io: A general architecture for structured inputs & outputs

Andrew Jaegle, Sebastian Borgeaud, Jean-Baptiste Alayrac, Carl Doersch, Catalin Ionescu, David Ding, Skanda Kop- pula, Daniel Zoran, Andrew Brock, Evan Shelhamer, et al. Perceiver io: A general architecture for structured inputs & outputs. InInternational Conference on Learning Represen- tations. 4
[19]

Vlm2vec: Training vision-language models for massive multimodal embedding tasks

Ziyan Jiang, Rui Meng, Xinyi Yang, Semih Yavuz, Yingbo Zhou, and Wenhu Chen. Vlm2vec: Training vision-language models for massive multimodal embedding tasks. InICLR,
[20]

Adaptive latent diffusion model for 3d medical image to image translation: Multi- modal magnetic resonance imaging study

Jonghun Kim and Hyunjin Park. Adaptive latent diffusion model for 3d medical image to image translation: Multi- modal magnetic resonance imaging study. InProceedings of the IEEE/CVF Winter Conference on Applications of Com- puter Vision (WACV), pages 7604–7613, 2024. 3

2024
[21]

Vilt: Vision- and-language transformer without convolution or region su- pervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 6

2021
[22]

Multi- modal data-efficient 3d scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025

Lingdong Kong, Xiang Xu, Jiawei Ren, Wenwei Zhang, Liang Pan, Kai Chen, Wei Tsang Ooi, and Ziwei Liu. Multi- modal data-efficient 3d scene understanding for autonomous driving.IEEE Transactions on Pattern Analysis and Machine Intelligence, 2025. 3

2025
[23]

Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021

Junnan Li, Ramprasaath Selvaraju, Akhilesh Gotmare, Shafiq Joty, Caiming Xiong, and Steven Chu Hong Hoi. Align before fuse: Vision and language representation learn- ing with momentum distillation.Advances in neural infor- mation processing systems, 34:9694–9705, 2021. 5

2021
[24]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 6

2022
[25]

Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. Blip-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. InIn- ternational conference on machine learning, pages 19730– 19742. PMLR, 2023. 2, 4, 5

2023
[26]

Exploring how generative mllms perceive more than clip with the same vision encoder.arXiv preprint arXiv:2411.05195, 2024

Siting Li, Pang Wei Koh, and Simon Shaolei Du. Exploring how generative mllms perceive more than clip with the same vision encoder.arXiv preprint arXiv:2411.05195, 2024. 2

work page arXiv 2024
[27]

Cross-view image geolocalization

Tsung-Yi Lin, Serge Belongie, and James Hays. Cross-view image geolocalization. InProceedings of the IEEE Con- ference on Computer Vision and Pattern Recognition, pages 891–898, 2013. 1

2013
[28]

Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023

Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning.Advances in neural information processing systems, 36:34892–34916, 2023. 2

2023
[29]

Lending orientation to neural networks for cross-view geo-localization

Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo-localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5624–5633, 2019. 1, 3

2019
[30]

Delving into multi-modal multi-task foun- dation models for road scene understanding: From learning paradigm perspectives.IEEE Transactions on Intelligent Ve- hicles, 2024

Sheng Luo, Wei Chen, Wanxin Tian, Rui Liu, Luanxuan Hou, Xiubao Zhang, Haifeng Shen, Ruiqi Wu, Shuyi Geng, Yi Zhou, et al. Delving into multi-modal multi-task foun- dation models for road scene understanding: From learning paradigm perspectives.IEEE Transactions on Intelligent Ve- hicles, 2024. 3

2024
[31]

SmolVLM: Redefining small and efficient multimodal models

Andr ´es Marafioti, Orr Zohar, Miquel Farr ´e, Merve Noyan, Elie Bakouch, Pedro Cuenca, Cyril Zakka, Loubna Ben Allal, Anton Lozhkov, Nouamane Tazi, et al. Smolvlm: Redefining small and efficient multimodal models.arXiv preprint arXiv:2504.05299, 2025. 3, 5

work page internal anchor Pith review arXiv 2025
[32]

In defense of dual-encoders for neural ranking

Aditya Menon, Sadeep Jayasumana, Ankit Singh Rawat, Se- ungyeon Kim, Sashank Reddi, and Sanjiv Kumar. In defense of dual-encoders for neural ranking. InInternational Con- ference on Machine Learning, pages 15376–15400. PMLR,
[33]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Repre- sentation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018. 1, 4

work page internal anchor Pith review Pith/arXiv arXiv 2018
[34]

Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022

Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Car- roll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, et al. Training language models to follow instructions with human feedback.Ad- vances in neural information processing systems, 35:27730– 27744, 2022. 2

2022
[35]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763, 2021. 1, 5, 6

2021
[36]

Cross-view image synthesis using conditional gans

Krishna Regmi and Ali Borji. Cross-view image synthesis using conditional gans. InProceedings of the IEEE con- ference on Computer Vision and Pattern Recognition, pages 3501–3510, 2018. 3

2018
[37]

Multi- modal vision pre-training for medical image analysis

Shaohao Rui, Lingzhi Chen, Zhenyu Tang, Lilong Wang, Mianxin Liu, Shaoting Zhang, and Xiaosong Wang. Multi- modal vision pre-training for medical image analysis. InPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 5164–5174, 2025. 3

2025
[38]

Toolformer: Lan- guage models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551,

Timo Schick, Jane Dwivedi-Yu, Roberto Dess `ı, Roberta Raileanu, Maria Lomeli, Eric Hambro, Luke Zettlemoyer, Nicola Cancedda, and Thomas Scialom. Toolformer: Lan- guage models can teach themselves to use tools.Advances in Neural Information Processing Systems, 36:68539–68551,
[39]

Spatial- aware feature aggregation for image based cross-view geo- localization.Advances in Neural Information Processing Systems, 32, 2019

Yujiao Shi, Liu Liu, Xin Yu, and Hongdong Li. Spatial- aware feature aggregation for image based cross-view geo- localization.Advances in Neural Information Processing Systems, 32, 2019. 3

2019
[40]

Optimal feature transport for cross-view image geo- localization

Yujiao Shi, Xin Yu, Liu Liu, Tong Zhang, and Hongdong Li. Optimal feature transport for cross-view image geo- localization. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 11990–11997, 2020. 3

2020
[41]

Gemini: A Family of Highly Capable Multimodal Models

Gemini Team, Rohan Anil, Sebastian Borgeaud, Jean- Baptiste Alayrac, Jiahui Yu, Radu Soricut, Johan Schalkwyk, Andrew M Dai, Anja Hauth, Katie Millican, et al. Gemini: a family of highly capable multimodal models.arXiv preprint arXiv:2312.11805, 2023. 1, 2

work page internal anchor Pith review Pith/arXiv arXiv 2023
[42]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszko- reit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017. 2, 3

2017
[43]

Qwen2-VL: Enhancing Vision-Language Model's Perception of the World at Any Resolution

Peng Wang, Shuai Bai, Sinan Tan, Shijie Wang, Zhihao Fan, Jinze Bai, Keqin Chen, Xuejing Liu, Jialin Wang, Wenbin Ge, et al. Qwen2-vl: Enhancing vision-language model’s perception of the world at any resolution.arXiv preprint arXiv:2409.12191, 2024. 3

work page internal anchor Pith review Pith/arXiv arXiv 2024
[44]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Weiyun Wang, Zhangwei Gao, Lixin Gu, Hengjun Pu, Long Cui, Xingguang Wei, Zhaoyang Liu, Linglin Jing, Sheng- long Ye, Jie Shao, et al. Internvl3.5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265, 2025. 2, 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2025
[45]

Image and object geo-localization.International Journal of Computer Vision, 132(4):1350–1392, 2024

Daniel Wilson, Xiaohan Zhang, Waqas Sultani, and Safwan Wshah. Image and object geo-localization.International Journal of Computer Vision, 132(4):1350–1392, 2024. 1, 3

2024
[46]

Wide-area image geolocalization with aerial reference im- agery

Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference im- agery. InIEEE International Conference on Computer Vi- sion (ICCV), pages 1–9, 2015. 1, 4

2015
[47]

A semantic-enhanced multi-modal remote sens- ing foundation model for earth observation.Nature Machine Intelligence, pages 1–15, 2025

Kang Wu, Yingying Zhang, Lixiang Ru, Bo Dang, Jiang- wei Lao, Lei Yu, Junwei Luo, Zifan Zhu, Yue Sun, Jiahao Zhang, et al. A semantic-enhanced multi-modal remote sens- ing foundation model for earth observation.Nature Machine Intelligence, pages 1–15, 2025. 3

2025
[48]

Cross-view panorama image synthesis.IEEE Transactions on Multimedia, 25: 3546–3559, 2022

Songsong Wu, Hao Tang, Xiao-Yuan Jing, Haifeng Zhao, Jianjun Qian, Nicu Sebe, and Yan Yan. Cross-view panorama image synthesis.IEEE Transactions on Multimedia, 25: 3546–3559, 2022. 3

2022
[49]

Cross-view geo-localization with layer-to-layer transformer.Advances in Neural Information Processing Systems, 34:29009–29020,

Hongji Yang, Xiufan Lu, and Yingying Zhu. Cross-view geo-localization with layer-to-layer transformer.Advances in Neural Information Processing Systems, 34:29009–29020,
[50]

Cross-view image geo- localization with panorama-bev co-retrieval network

Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, and Conghui He. Cross-view image geo- localization with panorama-bev co-retrieval network. In European Conference on Computer Vision, pages 74–90. Springer, 2024. 3, 4, 7

2024
[51]

Where am i? cross-view geo-localization with natural language descrip- tions

Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, and Weijia Li. Where am i? cross-view geo-localization with natural language descrip- tions. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 5890–5900, 2025. 1, 2, 3, 4, 5, 6, 7

2025
[52]

Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts

Yan Zeng, Xinsong Zhang, and Hang Li. Multi-grained vi- sion language pre-training: Aligning texts with visual con- cepts. InInternational Conference on Machine Learning, pages 25994–26009. PMLR, 2022. 5, 6

2022
[53]

Sigmoid loss for language image pre-training

Xiaohua Zhai, Basil Mustafa, Alexander Kolesnikov, and Lucas Beyer. Sigmoid loss for language image pre-training. InProceedings of the IEEE/CVF international conference on computer vision, pages 11975–11986, 2023. 5, 6

2023
[54]

Cross-view geo-localization via learning disentangled geometric layout correspondence

Xiaohan Zhang, Xingyu Li, Waqas Sultani, Yi Zhou, and Safwan Wshah. Cross-view geo-localization via learning disentangled geometric layout correspondence. InProceed- ings of the AAAI conference on artificial intelligence, pages 3480–3488, 2023. 3

2023
[55]

Cross- view image sequence geo-localization

Xiaohan Zhang, Waqas Sultani, and Safwan Wshah. Cross- view image sequence geo-localization. InProceedings of the IEEE/CVF Winter Conference on Applications of Computer Vision, pages 2914–2923, 2023

2023
[56]

Geodtr+: Toward generic cross-view ge- olocalization via geometric disentanglement.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024

Xiaohan Zhang, Xingyu Li, Waqas Sultani, Chen Chen, and Safwan Wshah. Geodtr+: Toward generic cross-view ge- olocalization via geometric disentanglement.IEEE Trans- actions on Pattern Analysis and Machine Intelligence, 2024. 3

2024
[57]

Vici: Vlm-instructed cross-view image-localisation

Xiaohan Zhang, Tavis Shore, Chen Chen, Oscar Mendez, Simon Hadfield, and Safwan Wshah. Vici: Vlm-instructed cross-view image-localisation. InProceedings of the 3rd In- ternational Workshop on UAVs in Multimedia: Capturing the World from a New Perspective, pages 21–25, 2025. 3

2025
[58]

University- 1652: A multi-view multi-source benchmark for drone- based geo-localization

Zhedong Zheng, Yunchao Wei, and Yi Yang. University- 1652: A multi-view multi-source benchmark for drone- based geo-localization. InProceedings of the 28th ACM international conference on Multimedia, pages 1395–1403,
[59]

Vigor: Cross- view image geo-localization beyond one-to-one retrieval

Sijie Zhu, Taojiannan Yang, and Chen Chen. Vigor: Cross- view image geo-localization beyond one-to-one retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3640–3649, 2021. 1, 4

2021
[60]

Transgeo: Trans- former is all you need for cross-view image geo-localization

Sijie Zhu, Mubarak Shah, and Chen Chen. Transgeo: Trans- former is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022. 3

2022
[61]

Simple, effective and general: A new backbone for cross-view image geo-localization,

Yingying Zhu, Hongji Yang, Yuxin Lu, and Qiang Huang. Simple, effective and general: A new back- bone for cross-view image geo-localization.arXiv preprint arXiv:2302.01572, 2023. 3

work page arXiv 2023