pith. machine review for the scientific record. sign in

arxiv: 2605.10345 · v1 · submitted 2026-05-11 · 💻 cs.CV

BGG: Bridging the Geometric Gap between Cross-View images by Vision Foundation Model Adaptation for Geo-Localization

Pith reviewed 2026-05-12 04:53 UTC · model grok-4.3

classification 💻 cs.CV
keywords Cross-View Geo-LocalizationVision Foundation ModelsParameter-Efficient AdaptationDilated ConvolutionsFrequency Domain ProcessingImage RetrievalDrone-Satellite Matching
0
0 comments X

The pith

Adapting a vision foundation model with multi-granularity and frequency modules bridges geometric gaps between drone and satellite views for better geo-localization.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes BGG as a parameter-efficient adaptation of vision foundation models like DINOv3 to handle the geometric differences that make cross-view geo-localization difficult. It adds an MFEA module that applies multi-level dilated convolutions to improve scale and viewpoint robustness in features, and an FASA module that processes patch tokens in the frequency domain to strengthen local structural details before fusing them with the CLS token. The approach aims to extract consistent, robust representations across views while keeping training costs low. If effective, this would allow pre-trained models to perform accurate image retrieval for geolocation tasks on standard benchmarks without full retraining. A reader would care because it targets practical improvements in retrieval accuracy for applications like mapping and navigation from mismatched image sources.

Core claim

BGG adapts a vision foundation model through a Multi-granularity Feature Enhancement Adapter (MFEA) that employs multi-level dilated convolutions to enhance scale adaptability and viewpoint robustness, thereby bridging the cross-view geometric gap with small training costs, combined with a Frequency-Aware Structural Aggregation (FASA) module that modulates patch tokens in the frequency domain and performs adaptive aggregation to enhance local structural features. The enhanced local features are fused with the CLS token to enable more accurate cross-view geo-localization, yielding state-of-the-art performance on the University-1652 and SUES-200 datasets.

What carries the argument

BGG adaptation framework consisting of the MFEA module (multi-level dilated convolutions for multi-granularity feature enhancement) and the FASA module (frequency-domain modulation and adaptive aggregation of patch tokens to supplement the CLS token).

If this is right

  • The adapted model captures robust and consistent features from cross-view images by leveraging VFM general representations.
  • Fusing frequency-enhanced local features with the CLS token improves image retrieval precision for geolocation.
  • The framework achieves state-of-the-art localization results on University-1652 and SUES-200 while using low training costs.
  • The generalization capabilities of the VFM are utilized to handle viewpoint and scale variations without full retraining.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The frequency-domain handling in FASA could extend to other retrieval tasks where structural consistency across domains matters, such as medical or remote-sensing image matching.
  • Parameter-efficient adapters of this form might reduce data requirements for new cross-view problems by building on existing foundation model weights.
  • If the modules prove stable across datasets, the method could support real-time updates to geo-localization systems with minimal compute.

Load-bearing premise

The MFEA and FASA modules will reliably bridge geometric gaps across arbitrary cross-view image pairs without introducing new artifacts or requiring dataset-specific hyperparameter tuning that raises training costs.

What would settle it

On a held-out cross-view dataset with larger scale or viewpoint shifts than University-1652, if BGG shows no accuracy gain over a plain VFM baseline while its training cost rises above the claimed low level, the bridging claim would not hold.

Figures

Figures reproduced from arXiv: 2605.10345 by Dou Quan, Licheng Jiao, Ning Huyan, Pei He, Shuang Wang, Wei Wang, Yi Li.

Figure 1
Figure 1. Figure 1: The feature maps of cross-view images captured by the Frozen [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Overview of the proposed efficient adaptation framework of a vision foundation model for CVGL between drone and satellite imagery (BGG). BGG [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed frequency-aware structural aggregation [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Visualization of cross-view feature maps on University-1652 for the pre-trained foundation model (DINOv3) under the Frozen and full-parameter [PITH_FULL_IMAGE:figures/full_fig_p011_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Top-5 retrieval results of the proposed method on the University-1652 dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Top-5 retrieval results of the proposed method on the SUES-200 dataset. [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
read the original abstract

Geometric differences between cross-view images, such as drone and satellite views, significantly increase the challenge of Cross-View Geo-Localization (CVGL), which aims to acquire the geolocation of images by image retrieval. To further enhance the CVGL performance, this paper proposes a parameter-efficient adaptation framework for bridging the geometric gap across images based on the vision foundation model (VFM) (e.g., DINOv3), termed BGG. BGG not only effectively leverages the general visual representations of VFM and captures the robust and consistent features from cross-view images, but also utilizes the generalization capabilities of the VFM, significantly improving the CVGL performance. It mainly contains a Multi-granularity Feature Enhancement Adapter (MFEA) and a Frequency-Aware Structural Aggregation (FASA) module. Specifically, MFEA enhances the scale adaptability and viewpoint robustness of features by multi-level dilated convolutions, effectively bridging the cross-view geometric gap with small training costs. Additionally, considering the [CLS] token lacks spatial details for precise image retrieval and localization, the FASA module modulates patch tokens in the frequency domain and performs adaptive aggregation for local structural feature enhancement. Finally, BGG fuses the enhanced local features with the [CLS] token for more accurate CVGL. Extensive experiments on University-1652 and SUES-200 datasets demonstrate that BGG has significant advantages over other methods and achieves state-of-the-art localization performance with low training costs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes BGG, a parameter-efficient adaptation framework for cross-view geo-localization (CVGL) that adapts a frozen vision foundation model (DINOv3). It introduces the Multi-granularity Feature Enhancement Adapter (MFEA) using multi-level dilated convolutions to improve scale adaptability and viewpoint robustness, and the Frequency-Aware Structural Aggregation (FASA) module that modulates patch tokens in the frequency domain with adaptive aggregation to enhance local structural features. The enhanced local features are fused with the [CLS] token for image retrieval. Experiments on University-1652 and SUES-200 datasets are reported to achieve state-of-the-art performance with low training costs.

Significance. If the empirical gains hold under scrutiny and the MFEA/FASA modules generalize without dataset-specific retuning or frequency artifacts, the work would provide a practical demonstration of leveraging VFMs for geometric gap bridging in CVGL, potentially enabling more efficient adaptation with reduced compute while maintaining or improving retrieval accuracy.

major comments (2)
  1. [§3.2] §3.2 (FASA module description): The frequency-domain modulation and adaptive aggregation of patch tokens is asserted to enhance local structure without introducing misalignments or artifacts under extreme viewpoint/scale shifts, but no analysis, visualizations, or ablation on frequency parameter sensitivity is referenced to confirm this; this is load-bearing for the central claim that FASA reliably bridges geometric gaps.
  2. [§4] §4 (Experiments): The SOTA claims and 'low training costs' plus 'generalization capabilities' assertions rest on results from only University-1652 and SUES-200; no cross-dataset transfer experiments (e.g., training on one and testing on another without retuning MFEA/FASA hyperparameters) or checks for degradation on other cross-view pairs are described, weakening support for the generalization claim.
minor comments (2)
  1. [Abstract] Abstract and §1: The phrasing 'significantly improving the CVGL performance' and 'significant advantages' is repeated without immediate quantitative anchors; consider adding a brief reference to the reported metrics (e.g., recall@1 gains) for clarity.
  2. [§3] Notation in §3: The description of MFEA's multi-level dilated convolutions and FASA's frequency modulation would benefit from an explicit equation or diagram label for the aggregation step to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for reviewing our manuscript and providing valuable feedback. We appreciate the referee's recognition of the potential of our BGG framework. We address the major comments point-by-point below, proposing revisions where necessary to strengthen the paper.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (FASA module description): The frequency-domain modulation and adaptive aggregation of patch tokens is asserted to enhance local structure without introducing misalignments or artifacts under extreme viewpoint/scale shifts, but no analysis, visualizations, or ablation on frequency parameter sensitivity is referenced to confirm this; this is load-bearing for the central claim that FASA reliably bridges geometric gaps.

    Authors: We thank the referee for highlighting this important aspect. While the empirical results on the benchmarks demonstrate the effectiveness of FASA in improving retrieval accuracy without apparent degradation from artifacts, we agree that explicit analysis would better support the claim. In the revised version, we will add: (1) visualizations showing the frequency spectra and reconstructed spatial features pre- and post-FASA to illustrate artifact-free enhancement; (2) an ablation study varying the frequency modulation parameters (e.g., low/high frequency emphasis) and reporting performance under controlled extreme scale and viewpoint variations. These additions will confirm that FASA bridges geometric gaps reliably. revision: yes

  2. Referee: [§4] §4 (Experiments): The SOTA claims and 'low training costs' plus 'generalization capabilities' assertions rest on results from only University-1652 and SUES-200; no cross-dataset transfer experiments (e.g., training on one and testing on another without retuning MFEA/FASA hyperparameters) or checks for degradation on other cross-view pairs are described, weakening support for the generalization claim.

    Authors: We acknowledge that dedicated cross-dataset transfer experiments would provide more direct evidence for the generalization capabilities. Although University-1652 and SUES-200 represent distinct environments (one university campus with drone/satellite, the other suburban with varying altitudes), and our method achieves SOTA on both using identical hyperparameters, we will include in the revision: training on University-1652 and evaluating zero-shot on SUES-200 (and the reverse) without any retuning of MFEA or FASA. This will quantify any degradation and further validate the VFM's generalization in bridging geometric gaps across different cross-view pairs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical module design with independent experimental validation

full rationale

The paper presents a parameter-efficient adaptation framework (BGG) consisting of MFEA (multi-level dilated convolutions for scale/viewpoint robustness) and FASA (frequency-domain modulation and adaptive aggregation of patch tokens). No equations, derivations, or 'predictions' are defined that reduce by construction to fitted inputs or self-referential definitions. Central claims rest on empirical improvements reported on University-1652 and SUES-200 datasets rather than any self-citation chain or uniqueness theorem imported from the authors' prior work. The modules are described as novel designs leveraging a frozen VFM backbone (DINOv3), with no load-bearing step that renames a known result or smuggles an ansatz via citation. This is a standard honest non-finding for an applied CV adaptation paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that VFM representations remain useful after lightweight adaptation and that frequency-domain modulation preserves localization-relevant structure.

pith-pipeline@v0.9.0 · 5581 in / 1155 out tokens · 54204 ms · 2026-05-12T04:53:15.681382+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 1 internal anchor

  1. [1]

    Localizing and orienting street views using over- head imagery,

    N. N. V o and J. Hays, “Localizing and orienting street views using over- head imagery,” inEuropean conference on computer vision. Springer, 2016, pp. 494–509

  2. [2]

    Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization,

    S. Hu, M. Feng, R. M. Nguyen, and G. H. Lee, “Cvm-net: Cross-view matching network for image-based ground-to-aerial geo-localization,” in Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2018, pp. 7258–7267

  3. [3]

    Mccg: A convnext- based multiple-classifier method for cross-view geo-localization,

    T. Shen, Y . Wei, L. Kang, S. Wan, and Y .-H. Yang, “Mccg: A convnext- based multiple-classifier method for cross-view geo-localization,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 3, pp. 1456–1468, 2023

  4. [4]

    Locating target re- gions for image retrieval in an unsupervised manner,

    B.-J. Zhang, G.-H. Liu, Z.-Y . Li, and S.-X. Song, “Locating target re- gions for image retrieval in an unsupervised manner,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 3, pp. 4664– 4676, 2025

  5. [5]

    University-1652: A multi-view multi- source benchmark for drone-based geo-localization,

    Z. Zheng, Y . Wei, and Y . Yang, “University-1652: A multi-view multi- source benchmark for drone-based geo-localization,” inProceedings of the 28th ACM international conference on Multimedia, 2020, pp. 1395– 1403

  6. [6]

    Deductive reinforcement learning for visual autonomous urban driving navigation,

    C. Huang, R. Zhang, M. Ouyang, P. Wei, J. Lin, J. Su, and L. Lin, “Deductive reinforcement learning for visual autonomous urban driving navigation,”IEEE Transactions on Neural Networks and Learning Systems, vol. 32, no. 12, pp. 5379–5391, 2021

  7. [7]

    Cross-view image matching for geo-localization in urban environments,

    Y . Tian, C. Chen, and M. Shah, “Cross-view image matching for geo-localization in urban environments,” inProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, 2017, pp. 3608–3616

  8. [8]

    Cross-view geo-localization with layer-to- layer transformer,

    H. Yang, X. Lu, and Y . Zhu, “Cross-view geo-localization with layer-to- layer transformer,”Advances in Neural Information Processing Systems, vol. 34, pp. 29 009–29 020, 2021

  9. [9]

    Transgeo: Transformer is all you need for cross-view image geo-localization,

    S. Zhu, M. Shah, and C. Chen, “Transgeo: Transformer is all you need for cross-view image geo-localization,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2022, pp. 1162–1171

  10. [10]

    Simple, effective and general: A new back- bone for cross-view image geo-localization.arXiv preprint arXiv:2302.01572, 2023

    Y . Zhu, H. Yang, Y . Lu, and Q. Huang, “Simple, effective and general: A new backbone for cross-view image geo-localization,”arXiv preprint arXiv:2302.01572, 2023

  11. [11]

    Netvlad: Cnn architecture for weakly supervised place recognition,

    R. Arandjelovic, P. Gronat, A. Torii, T. Pajdla, and J. Sivic, “Netvlad: Cnn architecture for weakly supervised place recognition,” inProceed- ings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 5297–5307

  12. [12]

    Fine-tuning cnn image retrieval with no human annotation,

    F. Radenovi ´c, G. Tolias, and O. Chum, “Fine-tuning cnn image retrieval with no human annotation,”IEEE transactions on pattern analysis and machine intelligence, vol. 41, no. 7, pp. 1655–1668, 2018

  13. [13]

    Sample4geo: Hard negative sam- pling for cross-view geo-localisation,

    F. Deuser, K. Habel, and N. Oswald, “Sample4geo: Hard negative sam- pling for cross-view geo-localisation,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2023, pp. 16 847–16 856

  14. [14]

    A practical cross-view image matching method between uav and satellite for uav-based geo- localization,

    L. Ding, J. Zhou, L. Meng, and Z. Long, “A practical cross-view image matching method between uav and satellite for uav-based geo- localization,”Remote Sensing, vol. 13, no. 1, p. 47, 2020

  15. [15]

    Sdpl: Shifting-dense partition learning for uav-view geo-localization,

    Q. Chen, T. Wang, Z. Yang, H. Li, R. Lu, Y . Sun, B. Zheng, and C. Yan, “Sdpl: Shifting-dense partition learning for uav-view geo-localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 34, no. 11, pp. 11 810–11 824, 2024

  16. [16]

    Game4loc: A uav geo-localization benchmark from game data,

    Y . Ji, B. He, Z. Tan, and L. Wu, “Game4loc: A uav geo-localization benchmark from game data,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 4, 2025, pp. 3913–3921

  17. [17]

    Uav-satellite view syn- thesis for cross-view geo-localization,

    X. Tian, J. Shao, D. Ouyang, and H. T. Shen, “Uav-satellite view syn- thesis for cross-view geo-localization,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4804–4815, 2021

  18. [18]

    Spatial-aware feature aggregation for image based cross-view geo-localization,

    Y . Shi, L. Liu, X. Yu, and H. Li, “Spatial-aware feature aggregation for image based cross-view geo-localization,”Advances in Neural Informa- tion Processing Systems, vol. 32, 2019

  19. [19]

    F3-net: Multiview scene matching for drone-based geo-localization,

    B. Sun, G. Liu, and Y . Yuan, “F3-net: Multiview scene matching for drone-based geo-localization,”IEEE Transactions on Geoscience and Remote Sensing, vol. 61, pp. 1–11, 2023

  20. [20]

    Enhancing cross-view geo-localization with domain alignment and scene consistency,

    P. Xia, Y . Wan, Z. Zheng, Y . Zhang, and J. Deng, “Enhancing cross-view geo-localization with domain alignment and scene consistency,”IEEE Transactions on Circuits and Systems for Video Technology, 2024

  21. [21]

    Each part matters: Local patterns facilitate cross-view geo-localization,

    T. Wang, Z. Zheng, C. Yan, J. Zhang, Y . Sun, B. Zheng, and Y . Yang, “Each part matters: Local patterns facilitate cross-view geo-localization,” IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 2, pp. 867–879, 2021. 14

  22. [22]

    Direction-guided multiscale feature fusion network for geo- localization,

    H. Lv, H. Zhu, R. Zhu, F. Wu, C. Wang, M. Cai, and K. Zhang, “Direction-guided multiscale feature fusion network for geo- localization,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–13, 2024

  23. [23]

    Joint representation learning and keypoint detection for cross-view geo-localization,

    J. Lin, Z. Zheng, Z. Zhong, Z. Luo, S. Li, Y . Yang, and N. Sebe, “Joint representation learning and keypoint detection for cross-view geo-localization,”IEEE Transactions on Image Processing, vol. 31, pp. 3780–3792, 2022

  24. [24]

    DINOv3

    O. Sim ´eoni, H. V . V o, M. Seitzer, F. Baldassarre, M. Oquab, C. Jose, V . Khalidov, M. Szafraniec, S. Yi, M. Ramamonjisoaet al., “Dinov3,” arXiv preprint arXiv:2508.10104, 2025

  25. [25]

    Parameter-efficient transfer learning for nlp,

    N. Houlsby, A. Giurgiu, S. Jastrzebski, B. Morrone, Q. De Laroussilhe, A. Gesmundo, M. Attariyan, and S. Gelly, “Parameter-efficient transfer learning for nlp,” inInternational conference on machine learning. PMLR, 2019, pp. 2790–2799

  26. [26]

    Enhancing domain generalization in medical image segmentation with global and local prompts,

    C. Zhao and X. Li, “Enhancing domain generalization in medical image segmentation with global and local prompts,”IEEE Transactions on Neural Networks and Learning Systems, vol. 36, no. 11, pp. 19 718– 19 732, 2025

  27. [27]

    Catastrophic forgetting in connectionist networks,

    R. M. French, “Catastrophic forgetting in connectionist networks,” Trends in cognitive sciences, vol. 3, no. 4, pp. 128–135, 1999

  28. [28]

    Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite,

    R. Zhu, L. Yin, M. Yang, F. Wu, Y . Yang, and W. Hu, “Sues-200: A multi-height multi-scene cross-view image benchmark across drone and satellite,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 33, no. 9, pp. 4825–4839, 2023

  29. [29]

    Wide-area image geolo- calization with aerial reference imagery,

    S. Workman, R. Souvenir, and N. Jacobs, “Wide-area image geolo- calization with aerial reference imagery,” inProceedings of the IEEE International Conference on Computer Vision, 2015, pp. 3961–3969

  30. [30]

    Lending orientation to neural networks for cross- view geo-localization,

    L. Liu and H. Li, “Lending orientation to neural networks for cross- view geo-localization,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2019, pp. 5624–5633

  31. [31]

    Vigor: Cross-view image geo-localization beyond one-to-one retrieval,

    S. Zhu, T. Yang, and C. Chen, “Vigor: Cross-view image geo-localization beyond one-to-one retrieval,” inProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, 2021, pp. 3640–3649

  32. [32]

    Coming down to earth: Satellite-to-street view synthesis for geo-localization,

    A. Toker, Q. Zhou, M. Maximov, and L. Leal-Taix ´e, “Coming down to earth: Satellite-to-street view synthesis for geo-localization,” inPro- ceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2021, pp. 6488–6497

  33. [33]

    Optimal feature transport for cross-view image geo-localization,

    Y . Shi, X. Yu, L. Liu, T. Zhang, and H. Li, “Optimal feature transport for cross-view image geo-localization,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 34, no. 07, 2020, pp. 11 990– 11 997

  34. [34]

    Cross-view geo- localization via learning disentangled geometric layout correspondence,

    X. Zhang, X. Li, W. Sultani, Y . Zhou, and S. Wshah, “Cross-view geo- localization via learning disentangled geometric layout correspondence,” inProceedings of the AAAI conference on artificial intelligence, vol. 37, no. 3, 2023, pp. 3480–3488

  35. [35]

    Geodtr+: To- ward generic cross-view geolocalization via geometric disentanglement,

    X. Zhang, X. Li, W. Sultani, C. Chen, and S. Wshah, “Geodtr+: To- ward generic cross-view geolocalization via geometric disentanglement,” IEEE Transactions on Pattern Analysis and Machine Intelligence, 2024

  36. [36]

    A transformer-based fea- ture segmentation and region alignment method for uav-view geo- localization,

    M. Dai, J. Hu, J. Zhuang, and E. Zheng, “A transformer-based fea- ture segmentation and region alignment method for uav-view geo- localization,”IEEE Transactions on Circuits and Systems for Video Technology, vol. 32, no. 7, pp. 4376–4389, 2021

  37. [37]

    Transfg: A cross-view geo-localization of satellite and uavs imagery pipeline using transformer- based feature aggregation and gradient guidance,

    H. Zhao, K. Ren, T. Yue, C. Zhang, and S. Yuan, “Transfg: A cross-view geo-localization of satellite and uavs imagery pipeline using transformer- based feature aggregation and gradient guidance,”IEEE Transactions on Geoscience and Remote Sensing, vol. 62, pp. 1–12, 2024

  38. [38]

    Multi-level embedding and alignment network with consistency and invariance learning for cross- view geo-localization,

    Z. Chen, Z.-X. Yang, and H.-J. Rong, “Multi-level embedding and alignment network with consistency and invariance learning for cross- view geo-localization,”IEEE Transactions on Geoscience and Remote Sensing, 2025

  39. [39]

    Mfaf: An eva02-based multi-scale frequency attention fusion method for cross-view geo-localization,

    Y . Liu, T. Liu, and Y . GU, “Mfaf: An eva02-based multi-scale frequency attention fusion method for cross-view geo-localization,”arXiv preprint arXiv:2509.12673, 2025

  40. [40]

    Beyond spatial domain: Multi-view geo-localization with frequency-based positive-incentive information screening,

    B. Sun, M. Li, B. Sun, G. Liu, C. Bi, W. Wang, X. Feng, G. Zhang, and B. Hu, “Beyond spatial domain: Multi-view geo-localization with frequency-based positive-incentive information screening,”Remote Sens- ing, vol. 18, no. 1, p. 88, 2025

  41. [41]

    Masked au- toencoders are scalable vision learners,

    K. He, X. Chen, S. Xie, Y . Li, P. Doll ´ar, and R. Girshick, “Masked au- toencoders are scalable vision learners,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2022, pp. 16 000–16 009

  42. [42]

    Learning transferable visual models from natural language supervision,

    A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervision,” inInternational conference on machine learning. PmLR, 2021, pp. 8748–8763

  43. [43]

    Lora: Low-rank adaptation of large language models

    E. J. Hu, Y . Shen, P. Wallis, Z. Allen-Zhu, Y . Li, S. Wang, L. Wang, W. Chenet al., “Lora: Low-rank adaptation of large language models.” ICLR, vol. 1, no. 2, p. 3, 2022

  44. [44]

    Adaptformer: Adapting vision transformers for scalable visual recogni- tion,

    S. Chen, C. Ge, Z. Tong, J. Wang, Y . Song, J. Wang, and P. Luo, “Adaptformer: Adapting vision transformers for scalable visual recogni- tion,”Advances in Neural Information Processing Systems, vol. 35, pp. 16 664–16 678, 2022

  45. [45]

    Mv-adapter: Multi-view consistent image generation made easy,

    Z. Huang, Y .-C. Guo, H. Wang, R. Yi, L. Ma, Y .-P. Cao, and L. Sheng, “Mv-adapter: Multi-view consistent image generation made easy,” in Proceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 16 377–16 387

  46. [46]

    Learning cross-view visual geo-localization without ground truth,

    H. Li, C. Xu, W. Yang, H. Yu, and G.-S. Xia, “Learning cross-view visual geo-localization without ground truth,”IEEE Transactions on Geoscience and Remote Sensing, 2024

  47. [47]

    Elp-adapters: Parameter efficient adapter tuning for various speech processing tasks,

    N. Inoue, S. Otake, T. Hirose, M. Ohi, and R. Kawakami, “Elp-adapters: Parameter efficient adapter tuning for various speech processing tasks,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, 2024

  48. [48]

    Vmt-adapter: Parameter- efficient transfer learning for multi-task dense scene understanding,

    Y . Xin, J. Du, Q. Wang, Z. Lin, and K. Yan, “Vmt-adapter: Parameter- efficient transfer learning for multi-task dense scene understanding,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 14, 2024, pp. 16 085–16 093

  49. [49]

    Convolutional bypasses are better vision transformer adapters,

    S. Jie and Z.-H. Deng, “Convolutional bypasses are better vision transformer adapters,”arXiv preprint arXiv:2207.07039, 2022

  50. [50]

    Robust cross-view geo-localization via content-viewpoint disentangle- ment,

    K. Li, D. Wang, X. Wang, Z. Wu, Y . Zhang, Y . Wang, and Q. Wang, “Robust cross-view geo-localization via content-viewpoint disentangle- ment,”arXiv preprint arXiv:2505.11822, 2025

  51. [51]

    Multiple- environment self-adaptive network for aerial-view geo-localization,

    T. Wang, Z. Zheng, Y . Sun, C. Yan, Y . Yang, and T.-S. Chua, “Multiple- environment self-adaptive network for aerial-view geo-localization,” Pattern Recognition, vol. 152, p. 110363, 2024

  52. [52]

    Multibranch joint representation learning based on information fusion strategy for cross-view geo-localization,

    F. Ge, Y . Zhang, Y . Liu, G. Wang, S. Coleman, D. Kerr, and L. Wang, “Multibranch joint representation learning based on information fusion strategy for cross-view geo-localization,”IEEE Transactions on Geo- science and Remote Sensing, vol. 62, pp. 1–16, 2024

  53. [53]

    Multilevel feedback joint representation learning network based on adaptive area elimination for cross-view geo-localization,

    F. Ge, Y . Zhang, L. Wang, W. Liu, Y . Liu, S. Coleman, and D. Kerr, “Multilevel feedback joint representation learning network based on adaptive area elimination for cross-view geo-localization,”IEEE trans- actions on geoscience and remote sensing, vol. 62, pp. 1–15, 2024

  54. [54]

    Ccr: A counterfactual causal reasoning- based method for cross-view geo-localization,

    H. Du, J. He, and Y . Zhao, “Ccr: A counterfactual causal reasoning- based method for cross-view geo-localization,”IEEE Transactions on Circuits and Systems for Video Technology, 2024