arxiv: 2605.10345 · v1 · submitted 2026-05-11 · 💻 cs.CV

BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

Wei Wang , Dou Quan , Ning Huyan , Shuang Wang , Yi Li , Pei He , Licheng Jiao This is my paper

Pith reviewed 2026-05-12 04:53 UTC · model grok-4.3

classification 💻 cs.CV

keywords Cross-View Geo-LocalizationVision Foundation ModelsParameter-Efficient AdaptationDilated ConvolutionsFrequency Domain ProcessingImage RetrievalDrone-Satellite Matching

0 comments

The pith

Adapting a vision foundation model with multi-granularity and frequency modules bridges geometric gaps between drone and satellite views for better geo-localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes BGG as a parameter-efficient adaptation of vision foundation models like DINOv3 to handle the geometric differences that make cross-view geo-localization difficult. It adds an MFEA module that applies multi-level dilated convolutions to improve scale and viewpoint robustness in features, and an FASA module that processes patch tokens in the frequency domain to strengthen local structural details before fusing them with the CLS token. The approach aims to extract consistent, robust representations across views while keeping training costs low. If effective, this would allow pre-trained models to perform accurate image retrieval for geolocation tasks on standard benchmarks without full retraining. A reader would care because it targets practical improvements in retrieval accuracy for applications like mapping and navigation from mismatched image sources.

Core claim

BGG adapts a vision foundation model through a Multi-granularity Feature Enhancement Adapter (MFEA) that employs multi-level dilated convolutions to enhance scale adaptability and viewpoint robustness, thereby bridging the cross-view geometric gap with small training costs, combined with a Frequency-Aware Structural Aggregation (FASA) module that modulates patch tokens in the frequency domain and performs adaptive aggregation to enhance local structural features. The enhanced local features are fused with the CLS token to enable more accurate cross-view geo-localization, yielding state-of-the-art performance on the University-1652 and SUES-200 datasets.

What carries the argument

BGG adaptation framework consisting of the MFEA module (multi-level dilated convolutions for multi-granularity feature enhancement) and the FASA module (frequency-domain modulation and adaptive aggregation of patch tokens to supplement the CLS token).

If this is right

The adapted model captures robust and consistent features from cross-view images by leveraging VFM general representations.
Fusing frequency-enhanced local features with the CLS token improves image retrieval precision for geolocation.
The framework achieves state-of-the-art localization results on University-1652 and SUES-200 while using low training costs.
The generalization capabilities of the VFM are utilized to handle viewpoint and scale variations without full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The frequency-domain handling in FASA could extend to other retrieval tasks where structural consistency across domains matters, such as medical or remote-sensing image matching.
Parameter-efficient adapters of this form might reduce data requirements for new cross-view problems by building on existing foundation model weights.
If the modules prove stable across datasets, the method could support real-time updates to geo-localization systems with minimal compute.

Load-bearing premise

The MFEA and FASA modules will reliably bridge geometric gaps across arbitrary cross-view image pairs without introducing new artifacts or requiring dataset-specific hyperparameter tuning that raises training costs.

What would settle it

On a held-out cross-view dataset with larger scale or viewpoint shifts than University-1652, if BGG shows no accuracy gain over a plain VFM baseline while its training cost rises above the claimed low level, the bridging claim would not hold.

Figures

Figures reproduced from arXiv: 2605.10345 by Dou Quan, Licheng Jiao, Ning Huyan, Pei He, Shuang Wang, Wei Wang, Yi Li.

**Figure 2.** Figure 2: Overview of the proposed efficient adaptation framework of a vision foundation model for CVGL between drone and satellite imagery (BGG). BGG [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the proposed frequency-aware structural aggregation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Visualization of cross-view feature maps on University-1652 for the pre-trained foundation model (DINOv3) under the Frozen and full-parameter [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗

**Figure 5.** Figure 5: Top-5 retrieval results of the proposed method on the University-1652 dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

**Figure 6.** Figure 6: Top-5 retrieval results of the proposed method on the SUES-200 dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

read the original abstract

Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BGG adapts a frozen DINOv3 with MFEA and FASA modules to improve cross-view geo-localization on two datasets but the experimental support remains thin until full results are checked.

read the letter

Hi, the main thing to know is that BGG adds two adapter-style modules to a frozen DINOv3 backbone for cross-view geo-localization. MFEA uses multi-level dilated convolutions to boost scale and viewpoint robustness, while FASA modulates patch tokens in the frequency domain and aggregates them to add spatial detail missing from the CLS token. The paper reports SOTA results on University-1652 and SUES-200 with low training cost by leveraging the VFM's existing representations rather than retraining everything. This is a reasonable, targeted extension of adapter techniques to the geometric-shift problem in CVGL, and the frequency-aware step for local features is a logical response to the limitations of global tokens. The approach stays practical and keeps parameter counts down, which fits the applied remote-sensing use case. The soft spots are mostly around evidence. The abstract asserts clear gains but supplies no numbers, ablations, baselines, or error bars, so it is hard to tell how much the new modules actually drive the improvement versus other choices. Experiments stay on just two datasets with no reported test of whether the same adapter weights and frequency settings transfer to other cross-view pairs without retuning. That leaves the generalization and low-cost claims open to the concern that hidden per-dataset adjustments or frequency artifacts could appear outside these sets. The citation pattern is standard and builds on prior VFM and adapter work without obvious gaps. This is for applied CV researchers who need efficient domain adaptation for UAV or satellite localization tasks. A reader looking for plug-in recipes on existing backbones will find the design relevant. It has enough concrete modules and empirical claims to deserve a serious referee who can examine the full tables, code, and any transfer tests.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes BGG, a parameter-efficient adaptation framework for cross-view geo-localization (CVGL) that adapts a frozen vision foundation model (DINOv3). It introduces the Multi-granularity Feature Enhancement Adapter (MFEA) using multi-level dilated convolutions to improve scale adaptability and viewpoint robustness, and the Frequency-Aware Structural Aggregation (FASA) module that modulates patch tokens in the frequency domain with adaptive aggregation to enhance local structural features. The enhanced local features are fused with the [CLS] token for image retrieval. Experiments on University-1652 and SUES-200 datasets are reported to achieve state-of-the-art performance with low training costs.

Significance. If the empirical gains hold under scrutiny and the MFEA/FASA modules generalize without dataset-specific retuning or frequency artifacts, the work would provide a practical demonstration of leveraging VFMs for geometric gap bridging in CVGL, potentially enabling more efficient adaptation with reduced compute while maintaining or improving retrieval accuracy.

major comments (2)

[§3.2] §3.2 (FASA module description): The frequency-domain modulation and adaptive aggregation of patch tokens is asserted to enhance local structure without introducing misalignments or artifacts under extreme viewpoint/scale shifts, but no analysis, visualizations, or ablation on frequency parameter sensitivity is referenced to confirm this; this is load-bearing for the central claim that FASA reliably bridges geometric gaps.
[§4] §4 (Experiments): The SOTA claims and 'low training costs' plus 'generalization capabilities' assertions rest on results from only University-1652 and SUES-200; no cross-dataset transfer experiments (e.g., training on one and testing on another without retuning MFEA/FASA hyperparameters) or checks for degradation on other cross-view pairs are described, weakening support for the generalization claim.

minor comments (2)

[Abstract] Abstract and §1: The phrasing 'significantly improving the CVGL performance' and 'significant advantages' is repeated without immediate quantitative anchors; consider adding a brief reference to the reported metrics (e.g., recall@1 gains) for clarity.
[§3] Notation in §3: The description of MFEA's multi-level dilated convolutions and FASA's frequency modulation would benefit from an explicit equation or diagram label for the aggregation step to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for reviewing our manuscript and providing valuable feedback. We appreciate the referee's recognition of the potential of our BGG framework. We address the major comments point-by-point below, proposing revisions where necessary to strengthen the paper.

read point-by-point responses

Referee: [§3.2] §3.2 (FASA module description): The frequency-domain modulation and adaptive aggregation of patch tokens is asserted to enhance local structure without introducing misalignments or artifacts under extreme viewpoint/scale shifts, but no analysis, visualizations, or ablation on frequency parameter sensitivity is referenced to confirm this; this is load-bearing for the central claim that FASA reliably bridges geometric gaps.

Authors: We thank the referee for highlighting this important aspect. While the empirical results on the benchmarks demonstrate the effectiveness of FASA in improving retrieval accuracy without apparent degradation from artifacts, we agree that explicit analysis would better support the claim. In the revised version, we will add: (1) visualizations showing the frequency spectra and reconstructed spatial features pre- and post-FASA to illustrate artifact-free enhancement; (2) an ablation study varying the frequency modulation parameters (e.g., low/high frequency emphasis) and reporting performance under controlled extreme scale and viewpoint variations. These additions will confirm that FASA bridges geometric gaps reliably. revision: yes
Referee: [§4] §4 (Experiments): The SOTA claims and 'low training costs' plus 'generalization capabilities' assertions rest on results from only University-1652 and SUES-200; no cross-dataset transfer experiments (e.g., training on one and testing on another without retuning MFEA/FASA hyperparameters) or checks for degradation on other cross-view pairs are described, weakening support for the generalization claim.

Authors: We acknowledge that dedicated cross-dataset transfer experiments would provide more direct evidence for the generalization capabilities. Although University-1652 and SUES-200 represent distinct environments (one university campus with drone/satellite, the other suburban with varying altitudes), and our method achieves SOTA on both using identical hyperparameters, we will include in the revision: training on University-1652 and evaluating zero-shot on SUES-200 (and the reverse) without any retuning of MFEA or FASA. This will quantify any degradation and further validate the VFM's generalization in bridging geometric gaps across different cross-view pairs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical module design with independent experimental validation

full rationale

The paper presents a parameter-efficient adaptation framework (BGG) consisting of MFEA (multi-level dilated convolutions for scale/viewpoint robustness) and FASA (frequency-domain modulation and adaptive aggregation of patch tokens). No equations, derivations, or 'predictions' are defined that reduce by construction to fitted inputs or self-referential definitions. Central claims rest on empirical improvements reported on University-1652 and SUES-200 datasets rather than any self-citation chain or uniqueness theorem imported from the authors' prior work. The modules are described as novel designs leveraging a frozen VFM backbone (DINOv3), with no load-bearing step that renames a known result or smuggles an ansatz via citation. This is a standard honest non-finding for an applied CV adaptation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that VFM representations remain useful after lightweight adaptation and that frequency-domain modulation preserves localization-relevant structure.

pith-pipeline@v0.9.0 · 5581 in / 1155 out tokens · 54204 ms · 2026-05-12T04:53:15.681382+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MFEA employs multi-level dilated convolutions... FASA modulates patch tokens in the frequency domain and performs adaptive aggregation
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

BGG... parameter-efficient adaptation framework... low training costs

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

[1]

Localizing and orienting street views using over- head imagery,

N. N. V o and J. Hays, “Localizing and orienting street views using over- head imagery,” inEuropean conference on computer vision. Springer, 2016, pp. 494–509

work page 2016
[2]

Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization,

S. Hu, M. Feng, R. M. Nguyen, and G. H. Lee, “Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7258–7267

work page 2018
[3]

Mccg: A convnext- based multiple-classifier method for cross-view geo-localization,

T. Shen, Y . Wei, L. Kang, S. Wan, and Y .-H. Yang, “Mccg: A convnext- based multiple-classifier method for cross-view geo-localization,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 3, pp. 1456–1468, 2023

work page 2023
[4]

Locating target re- gions for image retrieval in an unsupervised manner,

B.-J. Zhang, G.-H. Liu, Z.-Y . Li, and S.-X. Song, “Locating target re- gions for image retrieval in an unsupervised manner,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 3, pp. 4664– 4676, 2025

work page 2025
[5]

University-1652: A multi-view multi- source benchmark for drone-based geo-localization,

Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inProceedings of the 28th ACM international conference on Multimedia, 2020, pp. 1395– 1403

work page 2020
[6]

Deductive reinforcement learning for visual autonomous urban driving navigation,

C. Huang, R. Zhang, M. Ouyang, P. Wei, J. Lin, J. Su, and L. Lin, “Deductive reinforcement learning for visual autonomous urban driving navigation,”IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 12, pp. 5379–5391, 2021

work page 2021
[7]

Cross-view image matching for geo-localization in urban environments,

Y . Tian, C. Chen, and M. Shah, “Cross-view image matching for geo-localization in urban environments,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3608–3616

work page 2017
[8]

Cross-view geo-localization with layer-to- layer transformer,

H. Yang, X. Lu, and Y . Zhu, “Cross-view geo-localization with layer-to- layer transformer,”Advances in Neural Information Processing Systems, vol. 34, pp. 29 009–29 020, 2021

work page 2021
[9]

Transgeo: Transformer is all you need for cross-view image geo-localization,

S. Zhu, M. Shah, and C. Chen, “Transgeo: Transformer is all you need for cross-view image geo-localization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1162–1171

work page 2022
[10]

Simple, effective and general: A new back- bone for cross-view image geo-localization.arXiv preprint arXiv:2302.01572, 2023

Y . Zhu, H. Yang, Y . Lu, and Q. Huang, “Simple, effective and general: A new backbone for cross-view image geo-localization,”arXiv preprint arXiv:2302.01572, 2023

work page arXiv 2023
[11]

Netvlad: Cnn architecture for weakly supervised place recognition,

R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307

work page 2016
[12]

Fine-tuning cnn image retrieval with no human annotation,

F. Radenovi ´c, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,”IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1655–1668, 2018

work page 2018
[13]

Sample4geo: Hard negative sam- pling for cross-view geo-localisation,

F. Deuser, K. Habel, and N. Oswald, “Sample4geo: Hard negative sam- pling for cross-view geo-localisation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 847–16 856

work page 2023
[14]

A practical cross-view image matching method between uav and satellite for uav-based geo- localization,

L. Ding, J. Zhou, L. Meng, and Z. Long, “A practical cross-view image matching method between uav and satellite for uav-based geo- localization,”Remote Sensing, vol. 13, no. 1, p. 47, 2020

work page 2020
[15]

Sdpl: Shifting-dense partition learning for uav-view geo-localization,

Q. Chen, T. Wang, Z. Yang, H. Li, R. Lu, Y . Sun, B. Zheng, and C. Yan, “Sdpl: Shifting-dense partition learning for uav-view geo-localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 11, pp. 11 810–11 824, 2024

work page 2024
[16]

Game4loc: A uav geo-localization benchmark from game data,

Y . Ji, B. He, Z. Tan, and L. Wu, “Game4loc: A uav geo-localization benchmark from game data,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 3913–3921

work page 2025
[17]

Uav-satellite view syn- thesis for cross-view geo-localization,

X. Tian, J. Shao, D. Ouyang, and H. T. Shen, “Uav-satellite view syn- thesis for cross-view geo-localization,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4804–4815, 2021

work page 2021
[18]

Spatial-aware feature aggregation for image based cross-view geo-localization,

Y . Shi, L. Liu, X. Yu, and H. Li, “Spatial-aware feature aggregation for image based cross-view geo-localization,”Advances in Neural Informa- tion Processing Systems, vol. 32, 2019

work page 2019
[19]

F3-net: Multiview scene matching for drone-based geo-localization,

B. Sun, G. Liu, and Y . Yuan, “F3-net: Multiview scene matching for drone-based geo-localization,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–11, 2023

work page 2023
[20]

Enhancing cross-view geo-localization with domain alignment and scene consistency,

P. Xia, Y . Wan, Z. Zheng, Y . Zhang, and J. Deng, “Enhancing cross-view geo-localization with domain alignment and scene consistency,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

work page 2024
[21]

Each part matters: Local patterns facilitate cross-view geo-localization,

T. Wang, Z. Zheng, C. Yan, J. Zhang, Y . Sun, B. Zheng, and Y . Yang, “Each part matters: Local patterns facilitate cross-view geo-localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 2, pp. 867–879, 2021. 14

work page 2021
[22]

Direction-guided multiscale feature fusion network for geo- localization,

H. Lv, H. Zhu, R. Zhu, F. Wu, C. Wang, M. Cai, and K. Zhang, “Direction-guided multiscale feature fusion network for geo- localization,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–13, 2024

work page 2024
[23]

Joint representation learning and keypoint detection for cross-view geo-localization,

J. Lin, Z. Zheng, Z. Zhong, Z. Luo, S. Li, Y . Yang, and N. Sebe, “Joint representation learning and keypoint detection for cross-view geo-localization,”IEEE Transactions on Image Processing, vol. 31, pp. 3780–3792, 2022

work page 2022
[24]

DINOv3

O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

Parameter-efficient transfer learning for nlp,

N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” inInternational conference on machine learning. PMLR, 2019, pp. 2790–2799

work page 2019
[26]

Enhancing domain generalization in medical image segmentation with global and local prompts,

C. Zhao and X. Li, “Enhancing domain generalization in medical image segmentation with global and local prompts,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 11, pp. 19 718– 19 732, 2025

work page 2025
[27]

Catastrophic forgetting in connectionist networks,

R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999

work page 1999
[28]

Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite,

R. Zhu, L. Yin, M. Yang, F. Wu, Y . Yang, and W. Hu, “Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 4825–4839, 2023

work page 2023
[29]

Wide-area image geolo- calization with aerial reference imagery,

S. Workman, R. Souvenir, and N. Jacobs, “Wide-area image geolo- calization with aerial reference imagery,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3961–3969

work page 2015
[30]

Lending orientation to neural networks for cross- view geo-localization,

L. Liu and H. Li, “Lending orientation to neural networks for cross- view geo-localization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5624–5633

work page 2019
[31]

Vigor: Cross-view image geo-localization beyond one-to-one retrieval,

S. Zhu, T. Yang, and C. Chen, “Vigor: Cross-view image geo-localization beyond one-to-one retrieval,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021, pp. 3640–3649

work page 2021
[32]

Coming down to earth: Satellite-to-street view synthesis for geo-localization,

A. Toker, Q. Zhou, M. Maximov, and L. Leal-Taix ´e, “Coming down to earth: Satellite-to-street view synthesis for geo-localization,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6488–6497

work page 2021
[33]

Optimal feature transport for cross-view image geo-localization,

Y . Shi, X. Yu, L. Liu, T. Zhang, and H. Li, “Optimal feature transport for cross-view image geo-localization,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 990– 11 997

work page 2020
[34]

Cross-view geo- localization via learning disentangled geometric layout correspondence,

X. Zhang, X. Li, W. Sultani, Y . Zhou, and S. Wshah, “Cross-view geo- localization via learning disentangled geometric layout correspondence,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 3, 2023, pp. 3480–3488

work page 2023
[35]

Geodtr+: To- ward generic cross-view geolocalization via geometric disentanglement,

X. Zhang, X. Li, W. Sultani, C. Chen, and S. Wshah, “Geodtr+: To- ward generic cross-view geolocalization via geometric disentanglement,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

work page 2024
[36]

A transformer-based fea- ture segmentation and region alignment method for uav-view geo- localization,

M. Dai, J. Hu, J. Zhuang, and E. Zheng, “A transformer-based fea- ture segmentation and region alignment method for uav-view geo- localization,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4376–4389, 2021

work page 2021
[37]

Transfg: A cross-view geo-localization of satellite and uavs imagery pipeline using transformer- based feature aggregation and gradient guidance,

H. Zhao, K. Ren, T. Yue, C. Zhang, and S. Yuan, “Transfg: A cross-view geo-localization of satellite and uavs imagery pipeline using transformer- based feature aggregation and gradient guidance,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–12, 2024

work page 2024
[38]

Multi-level embedding and alignment network with consistency and invariance learning for cross- view geo-localization,

Z. Chen, Z.-X. Yang, and H.-J. Rong, “Multi-level embedding and alignment network with consistency and invariance learning for cross- view geo-localization,”IEEE Transactions on Geoscience and Remote Sensing, 2025

work page 2025
[39]

Mfaf: An eva02-based multi-scale frequency attention fusion method for cross-view geo-localization,

Y . Liu, T. Liu, and Y . GU, “Mfaf: An eva02-based multi-scale frequency attention fusion method for cross-view geo-localization,”arXiv preprint arXiv:2509.12673, 2025

work page arXiv 2025
[40]

Beyond spatial domain: Multi-view geo-localization with frequency-based positive-incentive information screening,

B. Sun, M. Li, B. Sun, G. Liu, C. Bi, W. Wang, X. Feng, G. Zhang, and B. Hu, “Beyond spatial domain: Multi-view geo-localization with frequency-based positive-incentive information screening,”Remote Sens- ing, vol. 18, no. 1, p. 88, 2025

work page 2025
[41]

Masked au- toencoders are scalable vision learners,

K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

work page 2022
[42]

Learning transferable visual models from natural language supervision,

A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

work page 2021
[43]

Lora: Low-rank adaptation of large language models

E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

work page 2022
[44]

Adaptformer: Adapting vision transformers for scalable visual recogni- tion,

S. Chen, C. Ge, Z. Tong, J. Wang, Y . Song, J. Wang, and P. Luo, “Adaptformer: Adapting vision transformers for scalable visual recogni- tion,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 664–16 678, 2022

work page 2022
[45]

Mv-adapter: Multi-view consistent image generation made easy,

Z. Huang, Y .-C. Guo, H. Wang, R. Yi, L. Ma, Y .-P. Cao, and L. Sheng, “Mv-adapter: Multi-view consistent image generation made easy,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 16 377–16 387

work page 2025
[46]

Learning cross-view visual geo-localization without ground truth,

H. Li, C. Xu, W. Yang, H. Yu, and G.-S. Xia, “Learning cross-view visual geo-localization without ground truth,”IEEE Transactions on Geoscience and Remote Sensing, 2024

work page 2024
[47]

Elp-adapters: Parameter efficient adapter tuning for various speech processing tasks,

N. Inoue, S. Otake, T. Hirose, M. Ohi, and R. Kawakami, “Elp-adapters: Parameter efficient adapter tuning for various speech processing tasks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

work page 2024
[48]

Vmt-adapter: Parameter- efficient transfer learning for multi-task dense scene understanding,

Y . Xin, J. Du, Q. Wang, Z. Lin, and K. Yan, “Vmt-adapter: Parameter- efficient transfer learning for multi-task dense scene understanding,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 14, 2024, pp. 16 085–16 093

work page 2024
[49]

Convolutional bypasses are better vision transformer adapters,

S. Jie and Z.-H. Deng, “Convolutional bypasses are better vision transformer adapters,”arXiv preprint arXiv:2207.07039, 2022

work page arXiv 2022
[50]

Robust cross-view geo-localization via content-viewpoint disentangle- ment,

K. Li, D. Wang, X. Wang, Z. Wu, Y . Zhang, Y . Wang, and Q. Wang, “Robust cross-view geo-localization via content-viewpoint disentangle- ment,”arXiv preprint arXiv:2505.11822, 2025

work page arXiv 2025
[51]

Multiple- environment self-adaptive network for aerial-view geo-localization,

T. Wang, Z. Zheng, Y . Sun, C. Yan, Y . Yang, and T.-S. Chua, “Multiple- environment self-adaptive network for aerial-view geo-localization,” Pattern Recognition, vol. 152, p. 110363, 2024

work page 2024
[52]

Multibranch joint representation learning based on information fusion strategy for cross-view geo-localization,

F. Ge, Y . Zhang, Y . Liu, G. Wang, S. Coleman, D. Kerr, and L. Wang, “Multibranch joint representation learning based on information fusion strategy for cross-view geo-localization,”IEEE Transactions on Geo- science and Remote Sensing, vol. 62, pp. 1–16, 2024

work page 2024
[53]

Multilevel feedback joint representation learning network based on adaptive area elimination for cross-view geo-localization,

F. Ge, Y . Zhang, L. Wang, W. Liu, Y . Liu, S. Coleman, and D. Kerr, “Multilevel feedback joint representation learning network based on adaptive area elimination for cross-view geo-localization,”IEEE trans- actions on geoscience and remote sensing, vol. 62, pp. 1–15, 2024

work page 2024
[54]

Ccr: A counterfactual causal reasoning- based method for cross-view geo-localization,

H. Du, J. He, and Y . Zhao, “Ccr: A counterfactual causal reasoning- based method for cross-view geo-localization,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

work page 2024