GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

Bo Du; Di Wang; En Wang; Haonan Guo; Jing Zhang; Wenbin Liu; Zidie Zhou; Zixuan Song

arxiv: 2512.02697 · v3 · submitted 2025-12-02 · 💻 cs.CV

GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization

Zixuan Song , Jing Zhang , Di Wang , Zidie Zhou , Wenbin Liu , Haonan Guo , En Wang , Bo Du This is my paper

Pith reviewed 2026-05-17 02:40 UTC · model grok-4.3

classification 💻 cs.CV

keywords cross-view geo-localizationsemantic-anchor mechanismmulti-view foundation modelimage-text alignmentGeoLoc datasetdrone street satellite viewslanguage-to-image retrieval

0 comments

The pith

GeoBridge uses a semantic-anchor mechanism to bridge multi-view images and text for robust geo-localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper sets out to establish that textual descriptions can serve as reliable semantic anchors to align features from drone, street-view, and satellite images, enabling bidirectional cross-view matching and language-to-image retrieval. This matters to a sympathetic reader because it relaxes the traditional dependence on always-available high-resolution satellite imagery and instead draws on complementary information across perspectives and modalities. The authors support the approach with a new large-scale dataset called GeoLoc that supplies aligned image pairs and descriptions from 36 countries. Experiments indicate that pre-training with this data raises geo-location accuracy while aiding generalization across domains and transfer between image and language modalities.

Core claim

GeoBridge is a foundation model that performs bidirectional matching across drone, street-view panorama, and satellite images while supporting language-to-image retrieval. It relies on a novel semantic-anchor mechanism that bridges multi-view visual features through shared textual descriptions. Pre-training on the newly constructed GeoLoc dataset, which contains over 50,000 aligned multi-view and text pairs from 36 countries, improves geo-location accuracy, cross-domain generalization, and cross-modal knowledge transfer.

What carries the argument

The semantic-anchor mechanism, which aligns multi-view visual features from drone, street-view, and satellite images by routing them through common textual descriptions.

If this is right

Pre-training on the GeoLoc dataset raises geo-location accuracy.
The approach improves cross-domain generalization across different geographic regions.
Cross-modal knowledge transfer occurs between language and image modalities.
Localization becomes possible in settings where up-to-date satellite imagery is unavailable.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Natural-language queries could replace image queries for finding locations in mapping systems.
The method may prove especially useful in regions where satellite updates are infrequent or costly.
Integrating live textual reports from users or sensors could further extend the model's utility.

Load-bearing premise

Textual descriptions can supply reliable semantic alignment across drone, street, and satellite views without major loss of spatial precision or injection of viewpoint-specific biases.

What would settle it

A controlled comparison in which geo-localization accuracy remains unchanged or drops when the semantic-anchor mechanism is removed and models rely only on direct visual matching between views.

Figures

Figures reproduced from arXiv: 2512.02697 by Bo Du, Di Wang, En Wang, Haonan Guo, Jing Zhang, Wenbin Liu, Zidie Zhou, Zixuan Song.

**Figure 1.** Figure 1: Schematic diagram of GeoBridge. Cross-view geo-location aims to match images with geo-referenced coordinates based on [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗

**Figure 2.** Figure 2: Overall workflow. Left: multi-view data processing for the GeoLoc dataset. Right: the GeoBridge method. (a) Global distribution [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Qualitative image retrieval results on the GeoLoc dataset. The red boxes indicate the true-matched images. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Qualitative results for cross-modal geo-location. Using [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of original drone images. 13 [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Examples of basic validity screening [PITH_FULL_IMAGE:figures/full_fig_p014_6.png] view at source ↗

**Figure 7.** Figure 7: Examples of blurry drone subimages. cover, or uneven illumination, as well as images with severe compression artifacts that lead to substantial detail loss. BH-Gate combines global pixel variance with an image sharpness measure to detect the absence of meaningful spatial detail. As illustrated in Fig.7, when an image exhibits extremely low texture variation, it is deemed to contain insufficient visual … view at source ↗

**Figure 8.** Figure 8: Examples of low global-contrast drone subimages [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗

**Figure 9.** Figure 9: Examples of uiform-texture and noisy pseudo-texture drone subimages [PITH_FULL_IMAGE:figures/full_fig_p016_9.png] view at source ↗

**Figure 10.** Figure 10: Examples of aligned tri-view images. 17 [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗

**Figure 11.** Figure 11: Tri-View instruction protocol for generating unified semantic descriptions. The blue text box denotes the instruction prompt; [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

**Figure 12.** Figure 12: Cross-modal geo-location drone image description instructions, with blue text boxes indicating the prompts. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗

**Figure 13.** Figure 13: Cross-modal geo-location street-panorama image description instructions, with blue text boxes indicating the prompts. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗

**Figure 14.** Figure 14: Cross-modal geo-location satellite image description instructions, with blue text boxes indicating the prompts. [PITH_FULL_IMAGE:figures/full_fig_p019_14.png] view at source ↗

**Figure 15.** Figure 15: Qualitative results for cross-modal geo-location. Using street view descriptions to match satellite perspectives, the top three [PITH_FULL_IMAGE:figures/full_fig_p021_15.png] view at source ↗

**Figure 16.** Figure 16: Qualitative results for cross-modal geo-location. Using satellite view descriptions to match drone perspectives, the top three [PITH_FULL_IMAGE:figures/full_fig_p022_16.png] view at source ↗

**Figure 17.** Figure 17: Qualitative results for cross-modal geo-location. Using drone view descriptions to match street perspectives, the top three results [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

read the original abstract

Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (\eg, drone, satellite, and street) and modalities (\eg, language and image). To address these challenges, we propose GeoBridge, a novel model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. Code, dataset, and pretrained models will be released at https://github.com/MiliLab/GeoBridge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GeoBridge brings a new multi-view dataset and a text-anchor idea for geo-localization, but the abstract gives no numbers so the gains are still unproven.

read the letter

The paper's clearest move is releasing GeoLoc, a dataset of more than 50,000 aligned drone, street-view, and satellite images with text descriptions collected across 36 countries. That scale and the cross-modal pairing are new relative to most prior geo-localization work. They also describe GeoBridge, which routes multi-view image features through a shared text space as a semantic anchor instead of depending only on satellite imagery. The claim is that this setup improves accuracy, cross-domain generalization, and language-to-image retrieval after pre-training on the new data. Releasing the code, dataset, and models is straightforward and helpful for anyone who wants to test the approach directly. The dataset construction itself looks like real work; getting consistent geographic and semantic alignment at that volume is not automatic. The semantic-anchor step is the part that needs the most scrutiny. Text descriptions tend to highlight categories and appearance while dropping precise metric relations and fine geometry. If the model projects features into that text space before matching, any loss of spatial detail could hurt disambiguation between nearby locations. The abstract states that experiments confirm marked improvements, yet it supplies no quantitative results, baselines, or ablation numbers. Without those details it is difficult to judge whether the reported gains come from the anchor mechanism, the new data, or something else. This work is aimed at groups already working on cross-view matching, remote-sensing foundation models, or language-guided localization. A reader who needs a large aligned multi-view corpus or who wants to experiment with text as an intermediate representation could extract value from the dataset even before the model is fully validated. The paper shows honest engagement with the satellite-dependency problem and the limits of single-view approaches. It deserves a serious referee to check the experimental setup and the actual performance numbers.

Referee Report

2 major / 2 minor

Summary. The paper proposes GeoBridge, a multi-view foundation model for cross-view geo-localization that introduces a semantic-anchor mechanism to bridge drone, street-view, and satellite image features via textual descriptions, enabling bidirectional image matching and language-to-image retrieval. It also presents the GeoLoc dataset of over 50,000 aligned multi-view image-text pairs collected from 36 countries. The authors claim that pre-training on GeoLoc yields marked improvements in geo-location accuracy, cross-domain generalization, and cross-modal knowledge transfer.

Significance. If the semantic-anchor mechanism proves effective without substantial degradation in spatial precision, the work would advance geo-localization by reducing dependence on high-resolution satellite imagery and supporting flexible multi-view and cross-modal tasks. The scale and geographic diversity of the released GeoLoc dataset, along with code and pretrained models, would provide a valuable benchmark resource for the community.

major comments (2)

[Abstract] Abstract: the claim that 'experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy' is unsupported by any quantitative results, baselines, ablation studies, or implementation details in the provided text, preventing verification of the central performance claims.
[Method] Method section on semantic-anchor mechanism: the assumption that textual descriptions reliably bridge multi-view features without loss of fine-grained spatial details or viewpoint biases is load-bearing for both novelty and claimed gains, yet natural language is lossy for metric relations and geometries needed to disambiguate locations; targeted ablations comparing localization error with and without the anchor step are required to substantiate this.

minor comments (2)

[Abstract] Abstract: 'broad evaluations across multiple tasks' is stated without enumerating the tasks or evaluation protocols.
[Dataset] Dataset description: additional details on textual description generation, quality control, and exact geographic sampling would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments, which highlight important areas for strengthening the presentation of our results and the validation of the semantic-anchor mechanism. We address each major comment below and have prepared revisions to the manuscript accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the claim that 'experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy' is unsupported by any quantitative results, baselines, ablation studies, or implementation details in the provided text, preventing verification of the central performance claims.

Authors: We agree that the abstract should provide immediate quantitative support for its central claim to allow readers to assess the improvements without first consulting the full experimental section. The full manuscript (Section 4 and Tables 1-3) reports concrete gains, including a 12.4% increase in top-1 recall for cross-view matching and 8.7% for language-to-image retrieval after GeoLoc pre-training, along with comparisons to baselines such as CVGL and CLIP-based models. To directly address the concern, we will revise the abstract to incorporate these key quantitative results and a brief mention of the evaluation protocol. revision: yes
Referee: [Method] Method section on semantic-anchor mechanism: the assumption that textual descriptions reliably bridge multi-view features without loss of fine-grained spatial details or viewpoint biases is load-bearing for both novelty and claimed gains, yet natural language is lossy for metric relations and geometries needed to disambiguate locations; targeted ablations comparing localization error with and without the anchor step are required to substantiate this.

Authors: We acknowledge that the semantic-anchor mechanism is central to the approach and that natural language descriptions are inherently lossy with respect to precise metric geometry. The current manuscript includes ablation studies (Section 4.3) that isolate the contribution of the anchor by comparing full GeoBridge against a variant without textual bridging, showing improved cross-view alignment and reduced domain gap. However, these ablations focus primarily on retrieval metrics rather than explicit localization error distributions or viewpoint-bias analysis. We will add a targeted ablation in the revised manuscript that directly measures localization error (in meters) with and without the anchor step, including breakdowns by viewpoint and geographic region to quantify any loss of spatial precision. revision: yes

Circularity Check

0 steps flagged

No significant circularity; proposal is self-contained with new mechanism and dataset

full rationale

The paper introduces GeoBridge as a new multi-view foundation model using a semantic-anchor mechanism and releases the GeoLoc dataset with over 50,000 aligned pairs. The abstract frames the work as extending prior foundation models via new components and experimental validation on the constructed dataset. No equations, predictions, or first-principles derivations are described that reduce by construction to fitted inputs, self-citations, or renamed known results. The central claims rest on empirical evaluations and cross-modal transfer, which are externally falsifiable and independent of the model's own definitions. This is a standard model-proposal paper without load-bearing circular steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

The central claim depends on the effectiveness of the semantic-anchor mechanism for feature bridging and on the assumption that the GeoLoc dataset provides representative geographic and semantic alignment across views and modalities.

invented entities (1)

semantic-anchor mechanism no independent evidence
purpose: To bridge multi-view image features through textual descriptions for robust localization
Introduced in the paper as the core novel component that enables bidirectional matching and language-to-image retrieval.

pith-pipeline@v0.9.0 · 5545 in / 1201 out tokens · 112414 ms · 2026-05-17T02:40:17.996261+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

57 extracted references · 57 canonical work pages · 1 internal anchor

[1]

Gnss high-precision augmentation for autonomous vehicles: Requirements, solution, and technical challenges.Remote Sensing, 15(6):1623, 2023

Liang Chen, Fu Zheng, Xiaopeng Gong, and Xinyuan Jiang. Gnss high-precision augmentation for autonomous vehicles: Requirements, solution, and technical challenges.Remote Sensing, 15(6):1623, 2023. 2

work page 2023
[2]

Mul- tilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization

Zhongwei Chen, Zhao-Xu Yang, and Hai-Jun Rong. Mul- tilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization. IEEE Transactions on Geoscience and Remote Sensing, 63: 1–15, 2025. 5, 6

work page 2025
[3]

Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching

Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching. In European Conference on Computer Vision, pages 213–231. Springer, 2024. 2, 3

work page 2024
[4]

Ming Dai, Jianhong Hu, Jiedong Zhuang, and Enhui Zheng. A transformer-based feature segmentation and region align- ment method for uav-view geo-localization.IEEE Transac- tions on Circuits and Systems for Video Technology, 32(7): 4376–4389, 2021. 2

work page 2021
[5]

Vision-based uav self- positioning in low-altitude urban environments.IEEE Trans- actions on Image Processing, 33:493–508, 2023

Ming Dai, Enhui Zheng, Zhenhua Feng, Lei Qi, Jiedong Zhuang, and Wankou Yang. Vision-based uav self- positioning in low-altitude urban environments.IEEE Trans- actions on Image Processing, 33:493–508, 2023. 2, 3, 4

work page 2023
[6]

Sam- ple4geo: Hard negative sampling for cross-view geo- localisation

Fabian Deuser, Konrad Habel, and Norbert Oswald. Sam- ple4geo: Hard negative sampling for cross-view geo- localisation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16847–16856, 2023. 5, 6

work page 2023
[7]

Ccr: A counter- factual causal reasoning-based method for cross-view geo- localization.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11630–11643, 2024

Haolin Du, Jingfei He, and Yuanqing Zhao. Ccr: A counter- factual causal reasoning-based method for cross-view geo- localization.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11630–11643, 2024. 5, 6

work page 2024
[8]

Cross-view geo-localization: a survey

Abhilash Durgam, Sidike Paheding, Vikas Dhiman, and Vi- jay Devabhaktuni. Cross-view geo-localization: a survey. IEEE Access, 2024. 2

work page 2024
[9]

Design and implementation of intelligent eod system based on six-rotor uav.Drones, 5(4):146, 2021

Jiwei Fan, Ruitao Lu, Xiaogang Yang, Fan Gao, Qingge Li, and Jun Zeng. Design and implementation of intelligent eod system based on six-rotor uav.Drones, 5(4):146, 2021. 2

work page 2021
[10]

Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

work page
[11]

Uncertainty-aware vision-based metric cross-view geolocal- ization

Florian Fervers, Sebastian Bullinger, Christoph Bo- densteiner, Michael Arens, and Rainer Stiefelhagen. Uncertainty-aware vision-based metric cross-view geolocal- ization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21621– 21631, 2023. 2

work page 2023
[12]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jen- nifer Wortman Vaughan, Hanna Wallach, Hal Daum´e Iii, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. 20

work page 2021
[13]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025. 6, 7

work page 2025
[14]

Vimgeo: Efficient cross-view geo-localization with vision mamba architecture

Jinglin Huang, Maoqiang Wu, Peichun Li, Wen Wu, and Rong Yu. Vimgeo: Efficient cross-view geo-localization with vision mamba architecture. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 1188–1196. International Joint Conferences on Artificial Intelligence Organization, 2025. Main Track. 6

work page 2025
[15]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024
[16]

Vilt: Vision- and-language transformer without convolution or region su- pervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 6, 8

work page 2021
[17]

Hao Li, Fabian Deuser, Wenping Yin, Xuanshu Luo, Paul Walther, Gengchen Mai, Wei Huang, and Martin Werner. Cross-view geolocalization and disaster mapping with street- view and vhr satellite imagery: A case study of hurricane ian.ISPRS Journal of Photogrammetry and Remote Sensing, 220:841–854, 2025. 2

work page 2025
[18]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 6, 8

work page 2022
[19]

Georea- soner: Geo-localization with reasoning in street views using a large vision-language model

Ling Li, Yu Ye, Bingchuan Jiang, and Wei Zeng. Georea- soner: Geo-localization with reasoning in street views using a large vision-language model. InForty-first International Conference on Machine Learning, 2024. 2

work page 2024
[20]

Geoformer: An effective transformer-based siamese network for uav geolocalization

Qingge Li, Xiaogang Yang, Jiwei Fan, Ruitao Lu, Bin Tang, Siyu Wang, and Shuang Su. Geoformer: An effective transformer-based siamese network for uav geolocalization. IEEE Journal of Selected Topics in Applied Earth Observa- tions and Remote Sensing, 17:9470–9491, 2024. 2

work page 2024
[21]

Lending orientation to neural networks for cross-view geo-localization

Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo-localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5624–5633, 2019. 3, 4

work page 2019
[22]

Segcn: A semantic-aware graph convolutional network for uav geo-localization.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17:6055– 6066, 2024

Xiangzeng Liu, Ziyao Wang, Yue Wu, and Qiguang Miao. Segcn: A semantic-aware graph convolutional network for uav geo-localization.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17:6055– 6066, 2024. 5

work page 2024
[23]

Direction-guided mul- tiscale feature fusion network for geo-localization.IEEE 9 Transactions on Geoscience and Remote Sensing, 62:1–13,

Hongxiang Lv, Hai Zhu, Runzhe Zhu, Fei Wu, Chunyuan Wang, Meiyu Cai, and Kaiyu Zhang. Direction-guided mul- tiscale feature fusion network for geo-localization.IEEE 9 Transactions on Geoscience and Remote Sensing, 62:1–13,

work page
[24]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5, 6, 8

work page 2021
[25]

Supervised text-based ge- olocation using language models on an adaptive grid

Stephen Roller, Michael Speriosu, Sarat Rallapalli, Ben- jamin Wing, and Jason Baldridge. Supervised text-based ge- olocation using language models on an adaptive grid. InPro- ceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 1500–1510, 2012. 2

work page 2012
[26]

Orienternet: Visual localization in 2d public maps with neural matching

Paul-Edouard Sarlin, Daniel DeTone, Tsun-Yi Yang, Armen Avetisyan, Julian Straub, Tomasz Malisiewicz, Samuel Rota Bulo, Richard Newcombe, Peter Kontschieder, and Vasileios Balntas. Orienternet: Visual localization in 2d public maps with neural matching. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21632–2164...

work page 2023
[27]

Mccg: A convnext-based multiple- classifier method for cross-view geo-localization.IEEE Transactions on Circuits and Systems for Video Technology, 34(3):1456–1468, 2024

Tianrui Shen, Yingmei Wei, Lai Kang, Shanshan Wan, and Yee-Hong Yang. Mccg: A convnext-based multiple- classifier method for cross-view geo-localization.IEEE Transactions on Circuits and Systems for Video Technology, 34(3):1456–1468, 2024. 5, 6

work page 2024
[28]

Jian Sun, Junlang Huang, Xinyu Jiang, Yimin Zhou, and Chi-Man VONG. Cgsi: Context-guided and uav’s status in- formed multimodal framework for generalizable cross-view geo-localization.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025. 5

work page 2025
[29]

Cross-view image matching for geo-localization in urban environments

Yicong Tian, Chen Chen, and Mubarak Shah. Cross-view image matching for geo-localization in urban environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608–3616, 2017. 3, 4

work page 2017
[30]

24/7 place recognition by view synthesis

Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817,

work page
[31]

Each part matters: Local patterns facilitate cross-view geo- localization.IEEE Transactions on Circuits and Systems for Video Technology, 32(2):867–879, 2021

Tingyu Wang, Zhedong Zheng, Chenggang Yan, Jiyong Zhang, Yaoqi Sun, Bolun Zheng, and Yi Yang. Each part matters: Local patterns facilitate cross-view geo- localization.IEEE Transactions on Circuits and Systems for Video Technology, 32(2):867–879, 2021. 2

work page 2021
[32]

Learning cross-view geo- localization embeddings via dynamic weighted decorrelation regularization.IEEE Transactions on Geoscience and Re- mote Sensing, 2024

Tingyu Wang, Zhedong Zheng, Zunjie Zhu, Yaoqi Sun, Chenggang Yan, and Yi Yang. Learning cross-view geo- localization embeddings via dynamic weighted decorrelation regularization.IEEE Transactions on Geoscience and Re- mote Sensing, 2024. 5

work page 2024
[33]

Fine-grained cross-view geo-localization using a correlation-aware homography estimator.Advances in Neu- ral Information Processing Systems, 36:5301–5319, 2023

Xiaolong Wang, Runsen Xu, Zhuofan Cui, Zeyu Wan, and Yu Zhang. Fine-grained cross-view geo-localization using a correlation-aware homography estimator.Advances in Neu- ral Information Processing Systems, 36:5301–5319, 2023. 6

work page 2023
[34]

Wide-area image geolocalization with aerial reference im- agery

Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference im- agery. InProceedings of the IEEE International Conference on Computer Vision, pages 3961–3969, 2015. 3, 4, 6

work page 2015
[35]

Camp: A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning.IEEE Transactions on Geo- science and Remote Sensing, 2024

Qiong Wu, Yi Wan, Zhi Zheng, Yongjun Zhang, Guang- shuai Wang, and Zhenyang Zhao. Camp: A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning.IEEE Transactions on Geo- science and Remote Sensing, 2024. 6

work page 2024
[36]

Uav-geoloc: A large- vocabulary dataset and geometry-transformed method for uav geo-localization.IEEE Robotics and Automation Let- ters, 2025

Rouwan Wu, Jiacheng Deng, Mingyu Mou, Xingyi He, Mao- jun Zhang, Yu Liu, and Shen Yan. Uav-geoloc: A large- vocabulary dataset and geometry-transformed method for uav geo-localization.IEEE Robotics and Automation Let- ters, 2025. 2

work page 2025
[37]

Enhancing cross-view geo-localization with domain alignment and scene consistency.IEEE Transactions on Circuits and Systems for Video Technology, 34(12):13271– 13281, 2024

Panwang Xia, Yi Wan, Zhi Zheng, Yongjun Zhang, and Jiwei Deng. Enhancing cross-view geo-localization with domain alignment and scene consistency.IEEE Transactions on Circuits and Systems for Video Technology, 34(12):13271– 13281, 2024. 5, 6

work page 2024
[38]

Cross-view geo-localization with panoramic street-view and vhr satellite imagery in decentral- ity settings.ISPRS Journal of Photogrammetry and Remote Sensing, 227:1–11, 2025

Panwang Xia, Lei Yu, Yi Wan, Qiong Wu, Peiqi Chen, Li- heng Zhong, Yongxiang Yao, Dong Wei, Xinyi Liu, Lixiang Ru, Yingying Zhang, Jiangwei Lao, Jingdong Chen, Ming Yang, and Yongjun Zhang. Cross-view geo-localization with panoramic street-view and vhr satellite imagery in decentral- ity settings.ISPRS Journal of Photogrammetry and Remote Sensing, 227:1–1...

work page 2025
[39]

Adapting fine-grained cross-view localization to areas with- out fine ground truth

Zimin Xia, Yujiao Shi, Hongdong Li, and Julian FP Kooij. Adapting fine-grained cross-view localization to areas with- out fine ground truth. InEuropean Conference on Computer Vision, pages 397–415. Springer, 2024. 2

work page 2024
[40]

Where am i? cross-view geo-localization with natural language descrip- tions.arXiv preprint arXiv:2412.17007, 2024

Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, and Weijia Li. Where am i? cross-view geo-localization with natural language descrip- tions.arXiv preprint arXiv:2412.17007, 2024. 2

work page arXiv 2024
[41]

Cross-view image geo- localization with panorama-bev co-retrieval network

Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, and Conghui He. Cross-view image geo- localization with panorama-bev co-retrieval network. In European Conference on Computer Vision, pages 74–90. Springer, 2024. 6

work page 2024
[42]

Cross-view image geo- localization with panorama-bev co-retrieval network

Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, and Conghui He. Cross-view image geo- localization with panorama-bev co-retrieval network. In European Conference on Computer Vision, pages 74–90. Springer, 2024. 2

work page 2024
[43]

Where am i? cross-view geo-localization with natural language descrip- tions

Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, and Weijia Li. Where am i? cross-view geo-localization with natural language descrip- tions. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 5890–5900, 2025. 2, 3, 6, 8

work page 2025
[44]

Aligning geometric spatial layout in cross-view geo-localization via feature re- combination

Qingwang Zhang and Yingying Zhu. Aligning geometric spatial layout in cross-view geo-localization via feature re- combination. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 7251–7259, 2024. 6

work page 2024
[45]

University- 1652: A multi-view multi-source benchmark for drone- based geo-localization

Zhedong Zheng, Yunchao Wei, and Yi Yang. University- 1652: A multi-view multi-source benchmark for drone- based geo-localization. InProceedings of the 28th ACM international conference on Multimedia, pages 1395–1403,

work page
[46]

Uav’s status is worth considering: A fusion represen- 10 tations matching method for geo-localization.Sensors, 23 (2):720, 2023

Runzhe Zhu, Mingze Yang, Ling Yin, Fei Wu, and Yuncheng Yang. Uav’s status is worth considering: A fusion represen- 10 tations matching method for geo-localization.Sensors, 23 (2):720, 2023. 5

work page 2023
[47]

Sues-200: A multi-height multi- scene cross-view image benchmark across drone and satel- lite.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4825–4839, 2023

Runzhe Zhu, Ling Yin, Mingze Yang, Fei Wu, Yuncheng Yang, and Wenbo Hu. Sues-200: A multi-height multi- scene cross-view image benchmark across drone and satel- lite.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4825–4839, 2023. 3, 4, 5, 6

work page 2023
[48]

Vigor: Cross- view image geo-localization beyond one-to-one retrieval

Sijie Zhu, Taojiannan Yang, and Chen Chen. Vigor: Cross- view image geo-localization beyond one-to-one retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3640–3649, 2021. 3, 4, 6

work page 2021
[49]

Transgeo: Trans- former is all you need for cross-view image geo-localization

Sijie Zhu, Mubarak Shah, and Chen Chen. Transgeo: Trans- former is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022. 6

work page 2022
[50]

Transgeo: Trans- former is all you need for cross-view image geo-localization

Sijie Zhu, Mubarak Shah, and Chen Chen. Transgeo: Trans- former is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022. 2

work page 2022
[51]

arXiv preprint arXiv:2302.01572 , year=

Yingying Zhu, Hongji Yang, Yuxin Lu, and Qiang Huang. Simple, effective and general: A new back- bone for cross-view image geo-localization.arXiv preprint arXiv:2302.01572, 2023. 5, 6 11 GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization Supplementary Material

work page arXiv 2023
[52]

The appendix is organized as follows: • Section 8

Overview This appendix supplements the proposed GeoBridge and our datasets GeoLoc with details excluded from the main paper due to space constraints. The appendix is organized as follows: • Section 8. Construction and preprocessing details of the GeoLoc dataset. • Section 9. Implementation details of instruction for- matting for constructing the descripti...

work page
[53]

Compared with the description in the main paper, this section provides more fine-grained operational details

GeoLoc Construction and Processing After acquiring the GeoLoc data, we performed systematic, multi-stage cleaning and quality control. Compared with the description in the main paper, this section provides more fine-grained operational details. It is worth noting that each step of the pipeline is manually monitored, and the overall process requires approx...

work page
[54]

Ant Street Inn

Instruction Details for the GeoLoc Dataset To construct high-quality cross-view semantic descriptions for GeoLoc, we design a unified instruction protocol that guides a large language model to generate consistent, viewpoint-agnostic textual annotations for each tri-view set (drone, satellite, and street-view images). The goal of these instructions is to e...

work page
[55]

In this setting, we generate a textual description from a single viewpoint and use it to retrieve images from the other viewpoints

Visualizations of Model Inference Results We visualize the cross-modal geo-location results. In this setting, we generate a textual description from a single viewpoint and use it to retrieve images from the other viewpoints. Fig. 15, 16, and 17 show the retrieval results for satellite images, drone images, and street-view images using descriptions derived...

work page
[56]

For what purpose was the dataset created?

Datasheets In this section, we document essential details about the proposed datasets and benchmarks following the CVPR Dataset and Benchmark guidelines and the template pro- vided by Gebruet al.[12]. 11.1. Motivation The questions in this section are primarily intended to en- courage dataset creators to clearly articulate their reasons for creating the d...

work page
[57]

Limitation and Potential Societal Impact In this section, we discuss the limitations and potential so- cietal impact of this work. 12.1. Potential Limitations Despite its scale and multi-view design, GeoLoc has sev- eral inherent limitations. First, the dataset is constrained by the availability of Google Street View and Google Maps Satellite imagery, whi...

work page

[1] [1]

Gnss high-precision augmentation for autonomous vehicles: Requirements, solution, and technical challenges.Remote Sensing, 15(6):1623, 2023

Liang Chen, Fu Zheng, Xiaopeng Gong, and Xinyuan Jiang. Gnss high-precision augmentation for autonomous vehicles: Requirements, solution, and technical challenges.Remote Sensing, 15(6):1623, 2023. 2

work page 2023

[2] [2]

Mul- tilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization

Zhongwei Chen, Zhao-Xu Yang, and Hai-Jun Rong. Mul- tilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization. IEEE Transactions on Geoscience and Remote Sensing, 63: 1–15, 2025. 5, 6

work page 2025

[3] [3]

Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching

Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching. In European Conference on Computer Vision, pages 213–231. Springer, 2024. 2, 3

work page 2024

[4] [4]

Ming Dai, Jianhong Hu, Jiedong Zhuang, and Enhui Zheng. A transformer-based feature segmentation and region align- ment method for uav-view geo-localization.IEEE Transac- tions on Circuits and Systems for Video Technology, 32(7): 4376–4389, 2021. 2

work page 2021

[5] [5]

Vision-based uav self- positioning in low-altitude urban environments.IEEE Trans- actions on Image Processing, 33:493–508, 2023

Ming Dai, Enhui Zheng, Zhenhua Feng, Lei Qi, Jiedong Zhuang, and Wankou Yang. Vision-based uav self- positioning in low-altitude urban environments.IEEE Trans- actions on Image Processing, 33:493–508, 2023. 2, 3, 4

work page 2023

[6] [6]

Sam- ple4geo: Hard negative sampling for cross-view geo- localisation

Fabian Deuser, Konrad Habel, and Norbert Oswald. Sam- ple4geo: Hard negative sampling for cross-view geo- localisation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16847–16856, 2023. 5, 6

work page 2023

[7] [7]

Ccr: A counter- factual causal reasoning-based method for cross-view geo- localization.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11630–11643, 2024

Haolin Du, Jingfei He, and Yuanqing Zhao. Ccr: A counter- factual causal reasoning-based method for cross-view geo- localization.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11630–11643, 2024. 5, 6

work page 2024

[8] [8]

Cross-view geo-localization: a survey

Abhilash Durgam, Sidike Paheding, Vikas Dhiman, and Vi- jay Devabhaktuni. Cross-view geo-localization: a survey. IEEE Access, 2024. 2

work page 2024

[9] [9]

Design and implementation of intelligent eod system based on six-rotor uav.Drones, 5(4):146, 2021

Jiwei Fan, Ruitao Lu, Xiaogang Yang, Fan Gao, Qingge Li, and Jun Zeng. Design and implementation of intelligent eod system based on six-rotor uav.Drones, 5(4):146, 2021. 2

work page 2021

[10] [10]

Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,

work page

[11] [11]

Uncertainty-aware vision-based metric cross-view geolocal- ization

Florian Fervers, Sebastian Bullinger, Christoph Bo- densteiner, Michael Arens, and Rainer Stiefelhagen. Uncertainty-aware vision-based metric cross-view geolocal- ization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21621– 21631, 2023. 2

work page 2023

[12] [12]

Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021

Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jen- nifer Wortman Vaughan, Hanna Wallach, Hal Daum´e Iii, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. 20

work page 2021

[13] [13]

Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025

Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025. 6, 7

work page 2025

[14] [14]

Vimgeo: Efficient cross-view geo-localization with vision mamba architecture

Jinglin Huang, Maoqiang Wu, Peichun Li, Wen Wu, and Rong Yu. Vimgeo: Efficient cross-view geo-localization with vision mamba architecture. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 1188–1196. International Joint Conferences on Artificial Intelligence Organization, 2025. Main Track. 6

work page 2025

[15] [15]

GPT-4o System Card

Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5

work page internal anchor Pith review Pith/arXiv arXiv 2024

[16] [16]

Vilt: Vision- and-language transformer without convolution or region su- pervision

Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 6, 8

work page 2021

[17] [17]

Hao Li, Fabian Deuser, Wenping Yin, Xuanshu Luo, Paul Walther, Gengchen Mai, Wei Huang, and Martin Werner. Cross-view geolocalization and disaster mapping with street- view and vhr satellite imagery: A case study of hurricane ian.ISPRS Journal of Photogrammetry and Remote Sensing, 220:841–854, 2025. 2

work page 2025

[18] [18]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 6, 8

work page 2022

[19] [19]

Georea- soner: Geo-localization with reasoning in street views using a large vision-language model

Ling Li, Yu Ye, Bingchuan Jiang, and Wei Zeng. Georea- soner: Geo-localization with reasoning in street views using a large vision-language model. InForty-first International Conference on Machine Learning, 2024. 2

work page 2024

[20] [20]

Geoformer: An effective transformer-based siamese network for uav geolocalization

Qingge Li, Xiaogang Yang, Jiwei Fan, Ruitao Lu, Bin Tang, Siyu Wang, and Shuang Su. Geoformer: An effective transformer-based siamese network for uav geolocalization. IEEE Journal of Selected Topics in Applied Earth Observa- tions and Remote Sensing, 17:9470–9491, 2024. 2

work page 2024

[21] [21]

Lending orientation to neural networks for cross-view geo-localization

Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo-localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5624–5633, 2019. 3, 4

work page 2019

[22] [22]

Segcn: A semantic-aware graph convolutional network for uav geo-localization.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17:6055– 6066, 2024

Xiangzeng Liu, Ziyao Wang, Yue Wu, and Qiguang Miao. Segcn: A semantic-aware graph convolutional network for uav geo-localization.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17:6055– 6066, 2024. 5

work page 2024

[23] [23]

Direction-guided mul- tiscale feature fusion network for geo-localization.IEEE 9 Transactions on Geoscience and Remote Sensing, 62:1–13,

Hongxiang Lv, Hai Zhu, Runzhe Zhu, Fei Wu, Chunyuan Wang, Meiyu Cai, and Kaiyu Zhang. Direction-guided mul- tiscale feature fusion network for geo-localization.IEEE 9 Transactions on Geoscience and Remote Sensing, 62:1–13,

work page

[24] [24]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5, 6, 8

work page 2021

[25] [25]

Supervised text-based ge- olocation using language models on an adaptive grid

Stephen Roller, Michael Speriosu, Sarat Rallapalli, Ben- jamin Wing, and Jason Baldridge. Supervised text-based ge- olocation using language models on an adaptive grid. InPro- ceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 1500–1510, 2012. 2

work page 2012

[26] [26]

Orienternet: Visual localization in 2d public maps with neural matching

Paul-Edouard Sarlin, Daniel DeTone, Tsun-Yi Yang, Armen Avetisyan, Julian Straub, Tomasz Malisiewicz, Samuel Rota Bulo, Richard Newcombe, Peter Kontschieder, and Vasileios Balntas. Orienternet: Visual localization in 2d public maps with neural matching. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21632–2164...

work page 2023

[27] [27]

Mccg: A convnext-based multiple- classifier method for cross-view geo-localization.IEEE Transactions on Circuits and Systems for Video Technology, 34(3):1456–1468, 2024

Tianrui Shen, Yingmei Wei, Lai Kang, Shanshan Wan, and Yee-Hong Yang. Mccg: A convnext-based multiple- classifier method for cross-view geo-localization.IEEE Transactions on Circuits and Systems for Video Technology, 34(3):1456–1468, 2024. 5, 6

work page 2024

[28] [28]

Jian Sun, Junlang Huang, Xinyu Jiang, Yimin Zhou, and Chi-Man VONG. Cgsi: Context-guided and uav’s status in- formed multimodal framework for generalizable cross-view geo-localization.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025. 5

work page 2025

[29] [29]

Cross-view image matching for geo-localization in urban environments

Yicong Tian, Chen Chen, and Mubarak Shah. Cross-view image matching for geo-localization in urban environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608–3616, 2017. 3, 4

work page 2017

[30] [30]

24/7 place recognition by view synthesis

Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817,

work page

[31] [31]

Each part matters: Local patterns facilitate cross-view geo- localization.IEEE Transactions on Circuits and Systems for Video Technology, 32(2):867–879, 2021

Tingyu Wang, Zhedong Zheng, Chenggang Yan, Jiyong Zhang, Yaoqi Sun, Bolun Zheng, and Yi Yang. Each part matters: Local patterns facilitate cross-view geo- localization.IEEE Transactions on Circuits and Systems for Video Technology, 32(2):867–879, 2021. 2

work page 2021

[32] [32]

Learning cross-view geo- localization embeddings via dynamic weighted decorrelation regularization.IEEE Transactions on Geoscience and Re- mote Sensing, 2024

Tingyu Wang, Zhedong Zheng, Zunjie Zhu, Yaoqi Sun, Chenggang Yan, and Yi Yang. Learning cross-view geo- localization embeddings via dynamic weighted decorrelation regularization.IEEE Transactions on Geoscience and Re- mote Sensing, 2024. 5

work page 2024

[33] [33]

Fine-grained cross-view geo-localization using a correlation-aware homography estimator.Advances in Neu- ral Information Processing Systems, 36:5301–5319, 2023

Xiaolong Wang, Runsen Xu, Zhuofan Cui, Zeyu Wan, and Yu Zhang. Fine-grained cross-view geo-localization using a correlation-aware homography estimator.Advances in Neu- ral Information Processing Systems, 36:5301–5319, 2023. 6

work page 2023

[34] [34]

Wide-area image geolocalization with aerial reference im- agery

Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference im- agery. InProceedings of the IEEE International Conference on Computer Vision, pages 3961–3969, 2015. 3, 4, 6

work page 2015

[35] [35]

Camp: A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning.IEEE Transactions on Geo- science and Remote Sensing, 2024

Qiong Wu, Yi Wan, Zhi Zheng, Yongjun Zhang, Guang- shuai Wang, and Zhenyang Zhao. Camp: A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning.IEEE Transactions on Geo- science and Remote Sensing, 2024. 6

work page 2024

[36] [36]

Uav-geoloc: A large- vocabulary dataset and geometry-transformed method for uav geo-localization.IEEE Robotics and Automation Let- ters, 2025

Rouwan Wu, Jiacheng Deng, Mingyu Mou, Xingyi He, Mao- jun Zhang, Yu Liu, and Shen Yan. Uav-geoloc: A large- vocabulary dataset and geometry-transformed method for uav geo-localization.IEEE Robotics and Automation Let- ters, 2025. 2

work page 2025

[37] [37]

Enhancing cross-view geo-localization with domain alignment and scene consistency.IEEE Transactions on Circuits and Systems for Video Technology, 34(12):13271– 13281, 2024

Panwang Xia, Yi Wan, Zhi Zheng, Yongjun Zhang, and Jiwei Deng. Enhancing cross-view geo-localization with domain alignment and scene consistency.IEEE Transactions on Circuits and Systems for Video Technology, 34(12):13271– 13281, 2024. 5, 6

work page 2024

[38] [38]

Cross-view geo-localization with panoramic street-view and vhr satellite imagery in decentral- ity settings.ISPRS Journal of Photogrammetry and Remote Sensing, 227:1–11, 2025

Panwang Xia, Lei Yu, Yi Wan, Qiong Wu, Peiqi Chen, Li- heng Zhong, Yongxiang Yao, Dong Wei, Xinyi Liu, Lixiang Ru, Yingying Zhang, Jiangwei Lao, Jingdong Chen, Ming Yang, and Yongjun Zhang. Cross-view geo-localization with panoramic street-view and vhr satellite imagery in decentral- ity settings.ISPRS Journal of Photogrammetry and Remote Sensing, 227:1–1...

work page 2025

[39] [39]

Adapting fine-grained cross-view localization to areas with- out fine ground truth

Zimin Xia, Yujiao Shi, Hongdong Li, and Julian FP Kooij. Adapting fine-grained cross-view localization to areas with- out fine ground truth. InEuropean Conference on Computer Vision, pages 397–415. Springer, 2024. 2

work page 2024

[40] [40]

Where am i? cross-view geo-localization with natural language descrip- tions.arXiv preprint arXiv:2412.17007, 2024

Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, and Weijia Li. Where am i? cross-view geo-localization with natural language descrip- tions.arXiv preprint arXiv:2412.17007, 2024. 2

work page arXiv 2024

[41] [41]

Cross-view image geo- localization with panorama-bev co-retrieval network

Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, and Conghui He. Cross-view image geo- localization with panorama-bev co-retrieval network. In European Conference on Computer Vision, pages 74–90. Springer, 2024. 6

work page 2024

[42] [42]

Cross-view image geo- localization with panorama-bev co-retrieval network

Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, and Conghui He. Cross-view image geo- localization with panorama-bev co-retrieval network. In European Conference on Computer Vision, pages 74–90. Springer, 2024. 2

work page 2024

[43] [43]

Where am i? cross-view geo-localization with natural language descrip- tions

Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, and Weijia Li. Where am i? cross-view geo-localization with natural language descrip- tions. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 5890–5900, 2025. 2, 3, 6, 8

work page 2025

[44] [44]

Aligning geometric spatial layout in cross-view geo-localization via feature re- combination

Qingwang Zhang and Yingying Zhu. Aligning geometric spatial layout in cross-view geo-localization via feature re- combination. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 7251–7259, 2024. 6

work page 2024

[45] [45]

University- 1652: A multi-view multi-source benchmark for drone- based geo-localization

Zhedong Zheng, Yunchao Wei, and Yi Yang. University- 1652: A multi-view multi-source benchmark for drone- based geo-localization. InProceedings of the 28th ACM international conference on Multimedia, pages 1395–1403,

work page

[46] [46]

Uav’s status is worth considering: A fusion represen- 10 tations matching method for geo-localization.Sensors, 23 (2):720, 2023

Runzhe Zhu, Mingze Yang, Ling Yin, Fei Wu, and Yuncheng Yang. Uav’s status is worth considering: A fusion represen- 10 tations matching method for geo-localization.Sensors, 23 (2):720, 2023. 5

work page 2023

[47] [47]

Sues-200: A multi-height multi- scene cross-view image benchmark across drone and satel- lite.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4825–4839, 2023

Runzhe Zhu, Ling Yin, Mingze Yang, Fei Wu, Yuncheng Yang, and Wenbo Hu. Sues-200: A multi-height multi- scene cross-view image benchmark across drone and satel- lite.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4825–4839, 2023. 3, 4, 5, 6

work page 2023

[48] [48]

Vigor: Cross- view image geo-localization beyond one-to-one retrieval

Sijie Zhu, Taojiannan Yang, and Chen Chen. Vigor: Cross- view image geo-localization beyond one-to-one retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3640–3649, 2021. 3, 4, 6

work page 2021

[49] [49]

Transgeo: Trans- former is all you need for cross-view image geo-localization

Sijie Zhu, Mubarak Shah, and Chen Chen. Transgeo: Trans- former is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022. 6

work page 2022

[50] [50]

Transgeo: Trans- former is all you need for cross-view image geo-localization

Sijie Zhu, Mubarak Shah, and Chen Chen. Transgeo: Trans- former is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022. 2

work page 2022

[51] [51]

arXiv preprint arXiv:2302.01572 , year=

Yingying Zhu, Hongji Yang, Yuxin Lu, and Qiang Huang. Simple, effective and general: A new back- bone for cross-view image geo-localization.arXiv preprint arXiv:2302.01572, 2023. 5, 6 11 GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization Supplementary Material

work page arXiv 2023

[52] [52]

The appendix is organized as follows: • Section 8

Overview This appendix supplements the proposed GeoBridge and our datasets GeoLoc with details excluded from the main paper due to space constraints. The appendix is organized as follows: • Section 8. Construction and preprocessing details of the GeoLoc dataset. • Section 9. Implementation details of instruction for- matting for constructing the descripti...

work page

[53] [53]

Compared with the description in the main paper, this section provides more fine-grained operational details

GeoLoc Construction and Processing After acquiring the GeoLoc data, we performed systematic, multi-stage cleaning and quality control. Compared with the description in the main paper, this section provides more fine-grained operational details. It is worth noting that each step of the pipeline is manually monitored, and the overall process requires approx...

work page

[54] [54]

Ant Street Inn

Instruction Details for the GeoLoc Dataset To construct high-quality cross-view semantic descriptions for GeoLoc, we design a unified instruction protocol that guides a large language model to generate consistent, viewpoint-agnostic textual annotations for each tri-view set (drone, satellite, and street-view images). The goal of these instructions is to e...

work page

[55] [55]

In this setting, we generate a textual description from a single viewpoint and use it to retrieve images from the other viewpoints

Visualizations of Model Inference Results We visualize the cross-modal geo-location results. In this setting, we generate a textual description from a single viewpoint and use it to retrieve images from the other viewpoints. Fig. 15, 16, and 17 show the retrieval results for satellite images, drone images, and street-view images using descriptions derived...

work page

[56] [56]

For what purpose was the dataset created?

Datasheets In this section, we document essential details about the proposed datasets and benchmarks following the CVPR Dataset and Benchmark guidelines and the template pro- vided by Gebruet al.[12]. 11.1. Motivation The questions in this section are primarily intended to en- courage dataset creators to clearly articulate their reasons for creating the d...

work page

[57] [57]

Limitation and Potential Societal Impact In this section, we discuss the limitations and potential so- cietal impact of this work. 12.1. Potential Limitations Despite its scale and multi-view design, GeoLoc has sev- eral inherent limitations. First, the dataset is constrained by the availability of Google Street View and Google Maps Satellite imagery, whi...

work page