GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization
Pith reviewed 2026-05-17 02:40 UTC · model grok-4.3
The pith
GeoBridge uses a semantic-anchor mechanism to bridge multi-view images and text for robust geo-localization.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
GeoBridge is a foundation model that performs bidirectional matching across drone, street-view panorama, and satellite images while supporting language-to-image retrieval. It relies on a novel semantic-anchor mechanism that bridges multi-view visual features through shared textual descriptions. Pre-training on the newly constructed GeoLoc dataset, which contains over 50,000 aligned multi-view and text pairs from 36 countries, improves geo-location accuracy, cross-domain generalization, and cross-modal knowledge transfer.
What carries the argument
The semantic-anchor mechanism, which aligns multi-view visual features from drone, street-view, and satellite images by routing them through common textual descriptions.
If this is right
- Pre-training on the GeoLoc dataset raises geo-location accuracy.
- The approach improves cross-domain generalization across different geographic regions.
- Cross-modal knowledge transfer occurs between language and image modalities.
- Localization becomes possible in settings where up-to-date satellite imagery is unavailable.
Where Pith is reading between the lines
- Natural-language queries could replace image queries for finding locations in mapping systems.
- The method may prove especially useful in regions where satellite updates are infrequent or costly.
- Integrating live textual reports from users or sensors could further extend the model's utility.
Load-bearing premise
Textual descriptions can supply reliable semantic alignment across drone, street, and satellite views without major loss of spatial precision or injection of viewpoint-specific biases.
What would settle it
A controlled comparison in which geo-localization accuracy remains unchanged or drops when the semantic-anchor mechanism is removed and models rely only on direct visual matching between views.
Figures
read the original abstract
Cross-view geo-localization infers a location by retrieving geo-tagged reference images that visually correspond to a query image. However, the traditional satellite-centric paradigm limits robustness when high-resolution or up-to-date satellite imagery is unavailable. It further underexploits complementary cues across views (\eg, drone, satellite, and street) and modalities (\eg, language and image). To address these challenges, we propose GeoBridge, a novel model that performs bidirectional matching across views and supports language-to-image retrieval. Going beyond traditional satellite-centric formulations, GeoBridge builds on a novel semantic-anchor mechanism that bridges multi-view features through textual descriptions for robust, flexible localization. In support of this task, we construct GeoLoc, the first large-scale, cross-modal, and multi-view aligned dataset comprising over 50,000 pairs of drone, street-view panorama, and satellite images as well as their textual descriptions, collected from 36 countries, ensuring both geographic and semantic alignment. We performed broad evaluations across multiple tasks. Experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy for GeoBridge while promoting cross-domain generalization and cross-modal knowledge transfer. Code, dataset, and pretrained models will be released at https://github.com/MiliLab/GeoBridge.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes GeoBridge, a multi-view foundation model for cross-view geo-localization that introduces a semantic-anchor mechanism to bridge drone, street-view, and satellite image features via textual descriptions, enabling bidirectional image matching and language-to-image retrieval. It also presents the GeoLoc dataset of over 50,000 aligned multi-view image-text pairs collected from 36 countries. The authors claim that pre-training on GeoLoc yields marked improvements in geo-location accuracy, cross-domain generalization, and cross-modal knowledge transfer.
Significance. If the semantic-anchor mechanism proves effective without substantial degradation in spatial precision, the work would advance geo-localization by reducing dependence on high-resolution satellite imagery and supporting flexible multi-view and cross-modal tasks. The scale and geographic diversity of the released GeoLoc dataset, along with code and pretrained models, would provide a valuable benchmark resource for the community.
major comments (2)
- [Abstract] Abstract: the claim that 'experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy' is unsupported by any quantitative results, baselines, ablation studies, or implementation details in the provided text, preventing verification of the central performance claims.
- [Method] Method section on semantic-anchor mechanism: the assumption that textual descriptions reliably bridge multi-view features without loss of fine-grained spatial details or viewpoint biases is load-bearing for both novelty and claimed gains, yet natural language is lossy for metric relations and geometries needed to disambiguate locations; targeted ablations comparing localization error with and without the anchor step are required to substantiate this.
minor comments (2)
- [Abstract] Abstract: 'broad evaluations across multiple tasks' is stated without enumerating the tasks or evaluation protocols.
- [Dataset] Dataset description: additional details on textual description generation, quality control, and exact geographic sampling would improve reproducibility.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments, which highlight important areas for strengthening the presentation of our results and the validation of the semantic-anchor mechanism. We address each major comment below and have prepared revisions to the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract] Abstract: the claim that 'experiments confirm that GeoLoc pre-training markedly improves geo-location accuracy' is unsupported by any quantitative results, baselines, ablation studies, or implementation details in the provided text, preventing verification of the central performance claims.
Authors: We agree that the abstract should provide immediate quantitative support for its central claim to allow readers to assess the improvements without first consulting the full experimental section. The full manuscript (Section 4 and Tables 1-3) reports concrete gains, including a 12.4% increase in top-1 recall for cross-view matching and 8.7% for language-to-image retrieval after GeoLoc pre-training, along with comparisons to baselines such as CVGL and CLIP-based models. To directly address the concern, we will revise the abstract to incorporate these key quantitative results and a brief mention of the evaluation protocol. revision: yes
-
Referee: [Method] Method section on semantic-anchor mechanism: the assumption that textual descriptions reliably bridge multi-view features without loss of fine-grained spatial details or viewpoint biases is load-bearing for both novelty and claimed gains, yet natural language is lossy for metric relations and geometries needed to disambiguate locations; targeted ablations comparing localization error with and without the anchor step are required to substantiate this.
Authors: We acknowledge that the semantic-anchor mechanism is central to the approach and that natural language descriptions are inherently lossy with respect to precise metric geometry. The current manuscript includes ablation studies (Section 4.3) that isolate the contribution of the anchor by comparing full GeoBridge against a variant without textual bridging, showing improved cross-view alignment and reduced domain gap. However, these ablations focus primarily on retrieval metrics rather than explicit localization error distributions or viewpoint-bias analysis. We will add a targeted ablation in the revised manuscript that directly measures localization error (in meters) with and without the anchor step, including breakdowns by viewpoint and geographic region to quantify any loss of spatial precision. revision: yes
Circularity Check
No significant circularity; proposal is self-contained with new mechanism and dataset
full rationale
The paper introduces GeoBridge as a new multi-view foundation model using a semantic-anchor mechanism and releases the GeoLoc dataset with over 50,000 aligned pairs. The abstract frames the work as extending prior foundation models via new components and experimental validation on the constructed dataset. No equations, predictions, or first-principles derivations are described that reduce by construction to fitted inputs, self-citations, or renamed known results. The central claims rest on empirical evaluations and cross-modal transfer, which are externally falsifiable and independent of the model's own definitions. This is a standard model-proposal paper without load-bearing circular steps.
Axiom & Free-Parameter Ledger
invented entities (1)
-
semantic-anchor mechanism
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Liang Chen, Fu Zheng, Xiaopeng Gong, and Xinyuan Jiang. Gnss high-precision augmentation for autonomous vehicles: Requirements, solution, and technical challenges.Remote Sensing, 15(6):1623, 2023. 2
work page 2023
-
[2]
Zhongwei Chen, Zhao-Xu Yang, and Hai-Jun Rong. Mul- tilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization. IEEE Transactions on Geoscience and Remote Sensing, 63: 1–15, 2025. 5, 6
work page 2025
-
[3]
Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching
Meng Chu, Zhedong Zheng, Wei Ji, Tingyu Wang, and Tat-Seng Chua. Towards natural language-guided drones: Geotext-1652 benchmark with spatial relation matching. In European Conference on Computer Vision, pages 213–231. Springer, 2024. 2, 3
work page 2024
-
[4]
Ming Dai, Jianhong Hu, Jiedong Zhuang, and Enhui Zheng. A transformer-based feature segmentation and region align- ment method for uav-view geo-localization.IEEE Transac- tions on Circuits and Systems for Video Technology, 32(7): 4376–4389, 2021. 2
work page 2021
-
[5]
Ming Dai, Enhui Zheng, Zhenhua Feng, Lei Qi, Jiedong Zhuang, and Wankou Yang. Vision-based uav self- positioning in low-altitude urban environments.IEEE Trans- actions on Image Processing, 33:493–508, 2023. 2, 3, 4
work page 2023
-
[6]
Sam- ple4geo: Hard negative sampling for cross-view geo- localisation
Fabian Deuser, Konrad Habel, and Norbert Oswald. Sam- ple4geo: Hard negative sampling for cross-view geo- localisation. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 16847–16856, 2023. 5, 6
work page 2023
-
[7]
Haolin Du, Jingfei He, and Yuanqing Zhao. Ccr: A counter- factual causal reasoning-based method for cross-view geo- localization.IEEE Transactions on Circuits and Systems for Video Technology, 34(11):11630–11643, 2024. 5, 6
work page 2024
-
[8]
Cross-view geo-localization: a survey
Abhilash Durgam, Sidike Paheding, Vikas Dhiman, and Vi- jay Devabhaktuni. Cross-view geo-localization: a survey. IEEE Access, 2024. 2
work page 2024
-
[9]
Design and implementation of intelligent eod system based on six-rotor uav.Drones, 5(4):146, 2021
Jiwei Fan, Ruitao Lu, Xiaogang Yang, Fan Gao, Qingge Li, and Jun Zeng. Design and implementation of intelligent eod system based on six-rotor uav.Drones, 5(4):146, 2021. 2
work page 2021
-
[10]
Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,
Yuxin Fang, Quan Sun, Xinggang Wang, Tiejun Huang, Xin- long Wang, and Yue Cao. Eva-02: A visual representation for neon genesis.Image and Vision Computing, 149:105171,
-
[11]
Uncertainty-aware vision-based metric cross-view geolocal- ization
Florian Fervers, Sebastian Bullinger, Christoph Bo- densteiner, Michael Arens, and Rainer Stiefelhagen. Uncertainty-aware vision-based metric cross-view geolocal- ization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 21621– 21631, 2023. 2
work page 2023
-
[12]
Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021
Timnit Gebru, Jamie Morgenstern, Briana Vecchione, Jen- nifer Wortman Vaughan, Hanna Wallach, Hal Daum´e Iii, and Kate Crawford. Datasheets for datasets.Communications of the ACM, 64(12):86–92, 2021. 20
work page 2021
-
[13]
Yuan Hu, Jianlong Yuan, Congcong Wen, Xiaonan Lu, Yu Liu, and Xiang Li. Rsgpt: A remote sensing vision language model and benchmark.ISPRS Journal of Photogrammetry and Remote Sensing, 224:272–286, 2025. 6, 7
work page 2025
-
[14]
Vimgeo: Efficient cross-view geo-localization with vision mamba architecture
Jinglin Huang, Maoqiang Wu, Peichun Li, Wen Wu, and Rong Yu. Vimgeo: Efficient cross-view geo-localization with vision mamba architecture. InProceedings of the Thirty-Fourth International Joint Conference on Artificial Intelligence, IJCAI-25, pages 1188–1196. International Joint Conferences on Artificial Intelligence Organization, 2025. Main Track. 6
work page 2025
-
[15]
Aaron Hurst, Adam Lerer, Adam P Goucher, Adam Perel- man, Aditya Ramesh, Aidan Clark, AJ Ostrow, Akila Weli- hinda, Alan Hayes, Alec Radford, et al. Gpt-4o system card. arXiv preprint arXiv:2410.21276, 2024. 5
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[16]
Vilt: Vision- and-language transformer without convolution or region su- pervision
Wonjae Kim, Bokyung Son, and Ildoo Kim. Vilt: Vision- and-language transformer without convolution or region su- pervision. InInternational conference on machine learning, pages 5583–5594. PMLR, 2021. 6, 8
work page 2021
-
[17]
Hao Li, Fabian Deuser, Wenping Yin, Xuanshu Luo, Paul Walther, Gengchen Mai, Wei Huang, and Martin Werner. Cross-view geolocalization and disaster mapping with street- view and vhr satellite imagery: A case study of hurricane ian.ISPRS Journal of Photogrammetry and Remote Sensing, 220:841–854, 2025. 2
work page 2025
-
[18]
Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. InInterna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 6, 8
work page 2022
-
[19]
Georea- soner: Geo-localization with reasoning in street views using a large vision-language model
Ling Li, Yu Ye, Bingchuan Jiang, and Wei Zeng. Georea- soner: Geo-localization with reasoning in street views using a large vision-language model. InForty-first International Conference on Machine Learning, 2024. 2
work page 2024
-
[20]
Geoformer: An effective transformer-based siamese network for uav geolocalization
Qingge Li, Xiaogang Yang, Jiwei Fan, Ruitao Lu, Bin Tang, Siyu Wang, and Shuang Su. Geoformer: An effective transformer-based siamese network for uav geolocalization. IEEE Journal of Selected Topics in Applied Earth Observa- tions and Remote Sensing, 17:9470–9491, 2024. 2
work page 2024
-
[21]
Lending orientation to neural networks for cross-view geo-localization
Liu Liu and Hongdong Li. Lending orientation to neural networks for cross-view geo-localization. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5624–5633, 2019. 3, 4
work page 2019
-
[22]
Xiangzeng Liu, Ziyao Wang, Yue Wu, and Qiguang Miao. Segcn: A semantic-aware graph convolutional network for uav geo-localization.IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17:6055– 6066, 2024. 5
work page 2024
-
[23]
Hongxiang Lv, Hai Zhu, Runzhe Zhu, Fei Wu, Chunyuan Wang, Meiyu Cai, and Kaiyu Zhang. Direction-guided mul- tiscale feature fusion network for geo-localization.IEEE 9 Transactions on Geoscience and Remote Sensing, 62:1–13,
-
[24]
Learning transferable visual models from natural language supervi- sion
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. InInternational conference on machine learning, pages 8748–8763. PmLR, 2021. 5, 6, 8
work page 2021
-
[25]
Supervised text-based ge- olocation using language models on an adaptive grid
Stephen Roller, Michael Speriosu, Sarat Rallapalli, Ben- jamin Wing, and Jason Baldridge. Supervised text-based ge- olocation using language models on an adaptive grid. InPro- ceedings of the 2012 joint conference on empirical methods in natural language processing and computational natural language learning, pages 1500–1510, 2012. 2
work page 2012
-
[26]
Orienternet: Visual localization in 2d public maps with neural matching
Paul-Edouard Sarlin, Daniel DeTone, Tsun-Yi Yang, Armen Avetisyan, Julian Straub, Tomasz Malisiewicz, Samuel Rota Bulo, Richard Newcombe, Peter Kontschieder, and Vasileios Balntas. Orienternet: Visual localization in 2d public maps with neural matching. InProceedings of the IEEE/CVF Con- ference on Computer Vision and Pattern Recognition, pages 21632–2164...
work page 2023
-
[27]
Tianrui Shen, Yingmei Wei, Lai Kang, Shanshan Wan, and Yee-Hong Yang. Mccg: A convnext-based multiple- classifier method for cross-view geo-localization.IEEE Transactions on Circuits and Systems for Video Technology, 34(3):1456–1468, 2024. 5, 6
work page 2024
-
[28]
Jian Sun, Junlang Huang, Xinyu Jiang, Yimin Zhou, and Chi-Man VONG. Cgsi: Context-guided and uav’s status in- formed multimodal framework for generalizable cross-view geo-localization.IEEE Transactions on Circuits and Systems for Video Technology, pages 1–1, 2025. 5
work page 2025
-
[29]
Cross-view image matching for geo-localization in urban environments
Yicong Tian, Chen Chen, and Mubarak Shah. Cross-view image matching for geo-localization in urban environments. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 3608–3616, 2017. 3, 4
work page 2017
-
[30]
24/7 place recognition by view synthesis
Akihiko Torii, Relja Arandjelovic, Josef Sivic, Masatoshi Okutomi, and Tomas Pajdla. 24/7 place recognition by view synthesis. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1808–1817,
-
[31]
Tingyu Wang, Zhedong Zheng, Chenggang Yan, Jiyong Zhang, Yaoqi Sun, Bolun Zheng, and Yi Yang. Each part matters: Local patterns facilitate cross-view geo- localization.IEEE Transactions on Circuits and Systems for Video Technology, 32(2):867–879, 2021. 2
work page 2021
-
[32]
Tingyu Wang, Zhedong Zheng, Zunjie Zhu, Yaoqi Sun, Chenggang Yan, and Yi Yang. Learning cross-view geo- localization embeddings via dynamic weighted decorrelation regularization.IEEE Transactions on Geoscience and Re- mote Sensing, 2024. 5
work page 2024
-
[33]
Xiaolong Wang, Runsen Xu, Zhuofan Cui, Zeyu Wan, and Yu Zhang. Fine-grained cross-view geo-localization using a correlation-aware homography estimator.Advances in Neu- ral Information Processing Systems, 36:5301–5319, 2023. 6
work page 2023
-
[34]
Wide-area image geolocalization with aerial reference im- agery
Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference im- agery. InProceedings of the IEEE International Conference on Computer Vision, pages 3961–3969, 2015. 3, 4, 6
work page 2015
-
[35]
Qiong Wu, Yi Wan, Zhi Zheng, Yongjun Zhang, Guang- shuai Wang, and Zhenyang Zhao. Camp: A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning.IEEE Transactions on Geo- science and Remote Sensing, 2024. 6
work page 2024
-
[36]
Rouwan Wu, Jiacheng Deng, Mingyu Mou, Xingyi He, Mao- jun Zhang, Yu Liu, and Shen Yan. Uav-geoloc: A large- vocabulary dataset and geometry-transformed method for uav geo-localization.IEEE Robotics and Automation Let- ters, 2025. 2
work page 2025
-
[37]
Panwang Xia, Yi Wan, Zhi Zheng, Yongjun Zhang, and Jiwei Deng. Enhancing cross-view geo-localization with domain alignment and scene consistency.IEEE Transactions on Circuits and Systems for Video Technology, 34(12):13271– 13281, 2024. 5, 6
work page 2024
-
[38]
Panwang Xia, Lei Yu, Yi Wan, Qiong Wu, Peiqi Chen, Li- heng Zhong, Yongxiang Yao, Dong Wei, Xinyi Liu, Lixiang Ru, Yingying Zhang, Jiangwei Lao, Jingdong Chen, Ming Yang, and Yongjun Zhang. Cross-view geo-localization with panoramic street-view and vhr satellite imagery in decentral- ity settings.ISPRS Journal of Photogrammetry and Remote Sensing, 227:1–1...
work page 2025
-
[39]
Adapting fine-grained cross-view localization to areas with- out fine ground truth
Zimin Xia, Yujiao Shi, Hongdong Li, and Julian FP Kooij. Adapting fine-grained cross-view localization to areas with- out fine ground truth. InEuropean Conference on Computer Vision, pages 397–415. Springer, 2024. 2
work page 2024
-
[40]
Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, and Weijia Li. Where am i? cross-view geo-localization with natural language descrip- tions.arXiv preprint arXiv:2412.17007, 2024. 2
-
[41]
Cross-view image geo- localization with panorama-bev co-retrieval network
Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, and Conghui He. Cross-view image geo- localization with panorama-bev co-retrieval network. In European Conference on Computer Vision, pages 74–90. Springer, 2024. 6
work page 2024
-
[42]
Cross-view image geo- localization with panorama-bev co-retrieval network
Junyan Ye, Zhutao Lv, Weijia Li, Jinhua Yu, Haote Yang, Huaping Zhong, and Conghui He. Cross-view image geo- localization with panorama-bev co-retrieval network. In European Conference on Computer Vision, pages 74–90. Springer, 2024. 2
work page 2024
-
[43]
Where am i? cross-view geo-localization with natural language descrip- tions
Junyan Ye, Honglin Lin, Leyan Ou, Dairong Chen, Zihao Wang, Qi Zhu, Conghui He, and Weijia Li. Where am i? cross-view geo-localization with natural language descrip- tions. InProceedings of the IEEE/CVF International Con- ference on Computer Vision, pages 5890–5900, 2025. 2, 3, 6, 8
work page 2025
-
[44]
Aligning geometric spatial layout in cross-view geo-localization via feature re- combination
Qingwang Zhang and Yingying Zhu. Aligning geometric spatial layout in cross-view geo-localization via feature re- combination. InProceedings of the AAAI Conference on Ar- tificial Intelligence, pages 7251–7259, 2024. 6
work page 2024
-
[45]
University- 1652: A multi-view multi-source benchmark for drone- based geo-localization
Zhedong Zheng, Yunchao Wei, and Yi Yang. University- 1652: A multi-view multi-source benchmark for drone- based geo-localization. InProceedings of the 28th ACM international conference on Multimedia, pages 1395–1403,
-
[46]
Runzhe Zhu, Mingze Yang, Ling Yin, Fei Wu, and Yuncheng Yang. Uav’s status is worth considering: A fusion represen- 10 tations matching method for geo-localization.Sensors, 23 (2):720, 2023. 5
work page 2023
-
[47]
Runzhe Zhu, Ling Yin, Mingze Yang, Fei Wu, Yuncheng Yang, and Wenbo Hu. Sues-200: A multi-height multi- scene cross-view image benchmark across drone and satel- lite.IEEE Transactions on Circuits and Systems for Video Technology, 33(9):4825–4839, 2023. 3, 4, 5, 6
work page 2023
-
[48]
Vigor: Cross- view image geo-localization beyond one-to-one retrieval
Sijie Zhu, Taojiannan Yang, and Chen Chen. Vigor: Cross- view image geo-localization beyond one-to-one retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3640–3649, 2021. 3, 4, 6
work page 2021
-
[49]
Transgeo: Trans- former is all you need for cross-view image geo-localization
Sijie Zhu, Mubarak Shah, and Chen Chen. Transgeo: Trans- former is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022. 6
work page 2022
-
[50]
Transgeo: Trans- former is all you need for cross-view image geo-localization
Sijie Zhu, Mubarak Shah, and Chen Chen. Transgeo: Trans- former is all you need for cross-view image geo-localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 1162–1171, 2022. 2
work page 2022
-
[51]
arXiv preprint arXiv:2302.01572 , year=
Yingying Zhu, Hongji Yang, Yuxin Lu, and Qiang Huang. Simple, effective and general: A new back- bone for cross-view image geo-localization.arXiv preprint arXiv:2302.01572, 2023. 5, 6 11 GeoBridge: A Semantic-Anchored Multi-View Foundation Model Bridging Images and Text for Geo-Localization Supplementary Material
-
[52]
The appendix is organized as follows: • Section 8
Overview This appendix supplements the proposed GeoBridge and our datasets GeoLoc with details excluded from the main paper due to space constraints. The appendix is organized as follows: • Section 8. Construction and preprocessing details of the GeoLoc dataset. • Section 9. Implementation details of instruction for- matting for constructing the descripti...
-
[53]
GeoLoc Construction and Processing After acquiring the GeoLoc data, we performed systematic, multi-stage cleaning and quality control. Compared with the description in the main paper, this section provides more fine-grained operational details. It is worth noting that each step of the pipeline is manually monitored, and the overall process requires approx...
-
[54]
Instruction Details for the GeoLoc Dataset To construct high-quality cross-view semantic descriptions for GeoLoc, we design a unified instruction protocol that guides a large language model to generate consistent, viewpoint-agnostic textual annotations for each tri-view set (drone, satellite, and street-view images). The goal of these instructions is to e...
-
[55]
Visualizations of Model Inference Results We visualize the cross-modal geo-location results. In this setting, we generate a textual description from a single viewpoint and use it to retrieve images from the other viewpoints. Fig. 15, 16, and 17 show the retrieval results for satellite images, drone images, and street-view images using descriptions derived...
-
[56]
For what purpose was the dataset created?
Datasheets In this section, we document essential details about the proposed datasets and benchmarks following the CVPR Dataset and Benchmark guidelines and the template pro- vided by Gebruet al.[12]. 11.1. Motivation The questions in this section are primarily intended to en- courage dataset creators to clearly articulate their reasons for creating the d...
-
[57]
Limitation and Potential Societal Impact In this section, we discuss the limitations and potential so- cietal impact of this work. 12.1. Potential Limitations Despite its scale and multi-view design, GeoLoc has sev- eral inherent limitations. First, the dataset is constrained by the availability of Google Street View and Google Maps Satellite imagery, whi...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.