Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

Chi-Nguyen Tran; Dao Sy Duy Minh; Huynh Trung Kiet; Long Tran-Thanh; Nguyen Lam Phu Quy; Phu-Hoa Pham

arxiv: 2605.11654 · v2 · pith:BK5HDFY3new · submitted 2026-05-12 · 💻 cs.CV · cs.AI· cs.RO

Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

Chi-Nguyen Tran , Dao Sy Duy Minh , Huynh Trung Kiet , Nguyen Lam Phu Quy , Phu-Hoa Pham , Long Tran-Thanh This is my paper

Pith reviewed 2026-05-20 23:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO

keywords cross-view geo-localizationprototype learningsemantic part discoveryweather robustnessvision transformersdrone navigationmulti-objective optimizationaltitude invariance

0 comments

The pith

Learnable prototypes group image patches to separate layout from texture for accurate drone-to-satellite matching that stays robust under weather changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes SkyPart as a lightweight add-on head for vision transformers that addresses three core problems in cross-view geo-localization: global descriptors that mix layout and texture, altitude scale effects retained in embeddings, and hand-tuned multi-loss training. It uses competing learnable prototypes assigned to patches by single-pass cosine similarity, applies altitude modulation only during training to create an altitude-free embedding at test time, adds graph-attention readout over the active prototypes, and replaces scalar weights with a Kendall uncertainty-weighted objective whose stationary points are Pareto-stationary. The resulting model is smaller than prior top performers yet reaches new state-of-the-art recall on SUES-200, University-1652, and DenseUAV under a strict single-pass protocol, with the gap widening on a ten-condition weather corruption benchmark.

Core claim

SkyPart institutes explicit part grouping over the patch grid of a vision transformer by letting a small set of learnable prototypes compete for patch tokens through single-pass cosine assignment, applies altitude-conditioned linear modulation exclusively during training so the final retrieval embedding is altitude-free at inference, routes the active prototypes through graph attention, and optimizes the combined objectives with Kendall uncertainty weighting to reach Pareto-stationary points, yielding higher single-pass retrieval accuracy than prior methods on standard benchmarks and a larger advantage under simulated weather corruptions.

What carries the argument

Learnable prototypes that compete for patch tokens via single-pass cosine assignment, paired with altitude-conditioned linear modulation applied only at training time.

If this is right

SkyPart reaches new state-of-the-art recall on SUES-200, University-1652, and DenseUAV using only single-pass retrieval without re-ranking or test-time augmentation.
The performance margin over the strongest baseline increases under the ten-condition WeatherPrompt corruption benchmark.
At 26.95 million parameters and 22.14 GFLOPs the model is the smallest among methods that achieve top-tier accuracy.
The final embedding is altitude-free at inference because modulation occurs only during training.
The Kendall uncertainty-weighted loss removes the need for hand-tuned scalar coefficients between incompatible gradient scales.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same prototype competition could be tested on other large viewpoint gaps such as ground-to-aerial retrieval where layout-texture separation is also required.
Because the prototypes are learned without explicit part labels, the method might generalize to unsupervised part discovery in other recognition tasks that suffer from domain shift.
The training-time altitude modulation trick could be applied to other scale-varying inputs such as varying camera distances in object detection.
A smaller model size combined with explicit robustness to weather corruptions lowers the barrier to deploying geo-localization on resource-limited drones.

Load-bearing premise

The prototypes assigned by cosine similarity actually isolate layout from texture across the drone-to-satellite view gap, and the training-only altitude modulation produces a truly altitude-invariant embedding at inference without any accuracy drop.

What would settle it

An ablation that removes the prototype assignment step and measures the resulting drop in recall specifically on the weather-corrupted test sets versus the clean sets would show whether the part separation mechanism is responsible for the reported robustness gain.

Figures

Figures reproduced from arXiv: 2605.11654 by Chi-Nguyen Tran, Dao Sy Duy Minh, Huynh Trung Kiet, Long Tran-Thanh, Nguyen Lam Phu Quy, Phu-Hoa Pham.

**Figure 1.** Figure 1: SKYPART overview. A shared DINOv2 ViT-S/14 encodes drone and satellite views; three readouts (global CLS, semantic parts with K learnable prototypes under altitude-conditioned FiLM, and a prototype GAT for layout) are merged by a learned fusion gate into a 768-D ℓ2-normalised embedding, retrieved by cosine similarity in one pass (no re-ranking, no TTA). Bottom: trainingonly GEOPARTLOSS with four uncertain… view at source ↗

**Figure 2.** Figure 2: Part-level evidence under weather shifts. Rows show clean drone inputs, their partlevel activations, paired satellite views, satellite part activations, weather-corrupted drone queries, and the corresponding part activations. Columns cover different corruptions and mixed weather conditions. Across substantial appearance changes, the part-discovery head continues to produce spatially structured activations… view at source ↗

**Figure 3.** Figure 3: Weather conditions. The same drone image under 10 WeatherPrompt augmentations. Texture is destroyed, but spatial structure persists-a pattern qualitatively aligned with layout-heavy representations and with SKYPART’s relative robustness under environmental noise. A3.4.1 Evaluation Protocol The evaluation protocol follows the WeatherPrompt guidelines: the satellite gallery remains clean while drone queries … view at source ↗

**Figure 4.** Figure 4: Weather robustness across three benchmarks (radar view). Per-condition Drone→Satellite R@1 (%) under the ten WeatherPrompt corruptions on SUES-200, University1652, and DenseUAV. SKYPART (red, filled) maintains a near-circular profile, indicating uniform robustness across all conditions, while baselines collapse on hard regimes (F+S, Dark). Numerical breakdown matches [PITH_FULL_IMAGE:figures/full_fig_p02… view at source ↗

**Figure 5.** Figure 5: Pareto efficiency across two benchmarks (D→S). R@1 vs. model size (params); bubble area ∝ GFLOPs. SKYPART (blue star) is Pareto-optimal on both SUES-200 (left) and University1652 (right), using fewer parameters and substantially lower compute than every baseline. Single-pass 448×448; no re-ranking, no TTA. A4.2 Limitations and Broader Impact Our train/test splits share a geographic region; cross-city or c… view at source ↗

**Figure 6.** Figure 6: Drone→Satellite top-5 retrieval. Each row is a drone query at a given altitude (row label), followed by the SKYPART part-attention heat map and the 5 highest-ranked satellite matches. Amber = correct, blue = incorrect. Geometric and transport priors. Polar warping [Shi et al., 2020] is the standard preprocessing for ground-panorama geometry, but on aerial tiles the reprojection is wrong and the train/test … view at source ↗

**Figure 7.** Figure 7: Satellite→Drone top-5 retrieval. Each row is a satellite query, its SKYPART part-attention heat map, and the top-5 drone images SKYPART retrieves across altitudes. Amber = correct, blue = incorrect. numbers because they measure something different from the embedding itself. Each added 1–4 pp on at least one benchmark; a proper evaluation of how they compound with SKYPART is left for future work. 29 [PITH_… view at source ↗

read the original abstract

Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SkyPart puts together prototypes, training-only altitude modulation, and a weighted loss into a compact head that claims SOTA on CVGL benchmarks including weather shifts, but the layout-texture separation step needs checking.

read the letter

The key point is that this work introduces SkyPart, a lightweight head for vision transformers in cross-view geo-localization. It groups patches using competing learnable prototypes via cosine similarity, modulates for altitude only at training time to get invariant embeddings, uses graph attention for readout, and applies a Kendall uncertainty-weighted loss for the multi-objective setup.

Referee Report

1 major / 1 minor

Summary. The paper proposes SkyPart, a lightweight swappable head for patch-based vision transformers in cross-view geo-localization. It uses four components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment to separate layout from texture, (ii) altitude-conditioned linear modulation applied only at training time to produce altitude-free embeddings at inference, (iii) graph-attention readout over active prototypes, and (iv) a Kendall uncertainty-weighted multi-objective loss with Pareto-stationary points. The method is reported to have 26.95M parameters and 22.14 GFLOPs, achieving new state-of-the-art results on SUES-200, University-1652, and DenseUAV under single-pass no-re-ranking no-TTA evaluation, with widening gains under a ten-condition WeatherPrompt corruption benchmark.

Significance. If the central claims hold, the work offers a practical, deployable advance for drone navigation under GNSS denial and weather variation by explicitly addressing layout-texture separation and altitude marginalization in a compact model. The explicit reporting of parameter count and GFLOPs, along with the multi-objective loss formulation referencing Pareto-stationarity, are strengths that support reproducibility and real-world applicability.

major comments (1)

[Abstract] Abstract, component (i): The claim that learnable prototypes via single-pass cosine assignment produce part groupings separating layout from texture across the view gap is load-bearing for the invariant embedding and reported gains, yet no auxiliary loss, orthogonality regularizer, or cross-view consistency term is described to enforce this separation. Without such a mechanism, assignment on raw patch tokens can cluster by low-level appearance or scale, which would invalidate the subsequent altitude modulation and graph readout contributions to the SOTA and weather-robustness results.

minor comments (1)

[Abstract] Abstract: The phrase 'theory-grounded components' is used for the four elements but the specific theoretical basis (e.g., for the single-pass assignment or graph readout) is not elaborated in the provided description.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on SkyPart. The concern regarding enforcement of layout-texture separation in the prototype assignment is well-taken, and we address it directly below with clarifications from the method design and planned revisions.

read point-by-point responses

Referee: [Abstract] Abstract, component (i): The claim that learnable prototypes via single-pass cosine assignment produce part groupings separating layout from texture across the view gap is load-bearing for the invariant embedding and reported gains, yet no auxiliary loss, orthogonality regularizer, or cross-view consistency term is described to enforce this separation. Without such a mechanism, assignment on raw patch tokens can cluster by low-level appearance or scale, which would invalidate the subsequent altitude modulation and graph readout contributions to the SOTA and weather-robustness results.

Authors: The separation arises from the competitive single-pass cosine assignment of patch tokens to a set of learnable prototypes, optimized end-to-end under the Kendall-weighted multi-objective loss for cross-view matching. Because drone and satellite views differ primarily in texture and scale while sharing layout structure, the prototypes are driven to capture layout-invariant parts to minimize retrieval loss; low-level appearance clustering would increase the loss and is therefore disfavored during training. The graph-attention readout further reinforces this by operating only on active prototypes, producing embeddings that marginalize texture. We acknowledge that the original manuscript does not include an auxiliary loss or explicit visualizations to demonstrate the separation. In the revision we will add (i) qualitative prototype assignment maps on paired cross-view images and (ii) an ablation that replaces competitive assignment with random or k-means clustering, quantifying the drop in both clean and weather-corrupted accuracy. These additions will make the load-bearing claim explicit without altering the core method. revision: partial

Circularity Check

0 steps flagged

No significant circularity detected in SkyPart derivation

full rationale

The four components (learnable prototypes via single-pass cosine assignment, altitude-conditioned linear modulation, graph-attention readout, and Kendall uncertainty-weighted loss) are introduced as distinct architectural choices whose separation of layout from texture or Pareto-stationary property is not algebraically forced by the input patch tokens or loss definitions. The Pareto-stationary claim explicitly references external Kendall work rather than deriving it internally or fitting it to the target result. No self-citation chain, self-definitional loop, or fitted-input-renamed-as-prediction appears in the provided derivation; the SOTA and weather-robustness claims rest on empirical benchmark results rather than reducing to tautological inputs. The method is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The approach rests on standard computer-vision assumptions about patch token processing in ViTs and the utility of cosine similarity for assignment; it introduces no new physical constants or external benchmarks beyond named datasets.

axioms (1)

domain assumption Vision transformers produce patch tokens that can be meaningfully grouped by semantic content across oblique and nadir views
Invoked by the prototype competition step described in the abstract.

invented entities (1)

SkyPart head with learnable prototypes no independent evidence
purpose: To perform explicit part grouping over the patch grid and enable altitude-free embeddings
New module proposed by the paper; no independent evidence outside the reported experiments is supplied.

pith-pipeline@v0.9.0 · 5832 in / 1500 out tokens · 79758 ms · 2026-05-20T23:01:26.638140+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

learnable prototypes competing for patch tokens via a single-pass cosine assignment
IndisputableMonolith/Foundation/AlphaCoordinateFixation.lean alpha_pin_under_high_calibration unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

altitude-conditioned linear modulation (FiLM) applied only during training

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 9 internal anchors

[1]

Emergence of invariance and disentanglement in deep representations

Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19 0 (50): 0 1--34, 2018. URL https://www.jmlr.org/papers/v19/17-646.html

work page 2018
[2]

NetVLAD : CNN architecture for weakly supervised place recognition

Relja Arandjelovi \'c , Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. NetVLAD : CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. URL https://openaccess.thecvf.com/content_cvpr_2016/html/Arandjelovic_NetVLAD_CNN_Architecture_CVPR_2016_paper.html

work page 2016
[3]

data2vec: A general framework for self-supervised learning in speech, vision and language

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. doi:10.48550/arXiv.2202.03555

work page doi:10.48550/arxiv.2202.03555 2022
[4]

Recognition-by-components: A theory of human image understanding

Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94 0 (2): 0 115--147, 1987. doi:10.1037/0033-295X.94.2.115

work page doi:10.1037/0033-295x.94.2.115 1987
[5]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650--9660, 2021. doi:10.1109/ICCV48922.2021.00951. URL https://openaccess.thecvf.com/content...

work page doi:10.1109/iccv48922.2021.00951 2021
[6]

SDPL : Shifting-dense partition learning for UAV -view geo-localization

Quan Chen, Tingyu Wang, Zihao Yang, Haoran Li, Rongfeng Lu, Yaoqi Sun, Bolun Zheng, and Chenggang Yan. SDPL : Shifting-dense partition learning for UAV -view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (11): 0 11810--11824, 2024. doi:10.1109/TCSVT.2024.3424196

work page doi:10.1109/tcsvt.2024.3424196 2024
[7]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), 2020. URL https://proceedings.mlr.press/v119/chen20j.html

work page 2020
[8]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. URL https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_An_Empirical_Study_of_Training_Self-Supervised_Vision_Transformers_ICCV_2021_paper.pdf

work page 2021
[9]

Multilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization

Zhongwei Chen, Zhao-Xu Yang, and Hai-Jun Rong. Multilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization. IEEE Transactions on Geoscience and Remote Sensing, 63: 0 1--15, 2025. doi:10.1109/TGRS.2025.3572775

work page doi:10.1109/tgrs.2025.3572775 2025
[10]

Group equivariant convolutional networks

Taco Cohen and Max Welling. Group equivariant convolutional networks. In International Conference on Machine Learning (ICML), 2016. URL https://proceedings.mlr.press/v48/cohenc16.html

work page 2016
[11]

Akhloufi

Andy Couturier and Moulay A. Akhloufi. A review on absolute visual localization for UAV . Robotics and Autonomous Systems, 135: 0 103666, 2021. doi:10.1016/j.robot.2020.103666

work page doi:10.1016/j.robot.2020.103666 2021
[12]

A transformer-based feature segmentation and region alignment method for UAV -view geo-localization

Ming Dai, Jianhong Hu, Jiedong Zhuang, and Enhui Zheng. A transformer-based feature segmentation and region alignment method for UAV -view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 32 0 (7): 0 4376--4389, 2022. doi:10.1109/TCSVT.2021.3135013

work page doi:10.1109/tcsvt.2021.3135013 2022
[13]

Toward understanding asset flows in crypto money laundering through the lenses of ethereum heists

Ming Dai, Enhui Zheng, Zhenhua Feng, Lei Qi, Jiedong Zhuang, and Wankou Yang. Vision-based UAV self-positioning in low-altitude urban environments. IEEE Transactions on Image Processing, 33: 0 493--508, 2024. doi:10.1109/TIP.2023.3346279

work page doi:10.1109/tip.2023.3346279 2024
[14]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Fabian Deuser, Konrad Habel, and Norbert Oswald. Sample4Geo : Hard negative sampling for cross-view geo-localisation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16801--16810, 2023. doi:10.1109/ICCV51070.2023.01545

work page doi:10.1109/iccv51070.2023.01545 2023
[15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representatio...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2021
[16]

CCR : A counterfactual causal reasoning-based method for cross-view geo-localization

Haolin Du, Jingfei He, and Yuanqing Zhao. CCR : A counterfactual causal reasoning-based method for cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (11): 0 11630--11643, 2024. doi:10.1109/TCSVT.2024.3425509

work page doi:10.1109/tcsvt.2024.3425509 2024
[17]

Dumoulin , author E

Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio. Feature-wise transformations. Distill, 2018. doi:10.23915/distill.00011

work page doi:10.23915/distill.00011 2018
[18]

Multi-weather cross-view geo-localization using denoising diffusion models

Tongtong Feng, Qing Li, Xin Wang, Mingzi Wang, Guangyao Li, and Wenwu Zhu. Multi-weather cross-view geo-localization using denoising diffusion models. In Proceedings of the 2nd Workshop on UAV s in Multimedia (UAVM) , pages 35--39, 2024. doi:10.1145/3689095.3689103

work page doi:10.1145/3689095.3689103 2024
[19]

Unsupervised Domain Adaptation by Backpropagation

Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML), 2015. doi:10.48550/arXiv.1409.7495

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.7495 2015
[20]

Semantic concept perception network with interactive prompting for cross-view image geo-localization

Yuan Gao, Haibo Liu, and Xiaohui Wei. Semantic concept perception network with interactive prompting for cross-view image geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 35 0 (6): 0 5343--5354, 2025. doi:10.1109/TCSVT.2025.3533574

work page doi:10.1109/tcsvt.2025.3533574 2025
[21]

Multilevel feedback joint representation learning network based on adaptive area elimination for cross-view geo-localization

Fawei Ge, Yunzhou Zhang, Li Wang, Wei Liu, Yixiu Liu, Sonya Coleman, and Dermot Kerr. Multilevel feedback joint representation learning network based on adaptive area elimination for cross-view geo-localization. IEEE Transactions on Geoscience and Remote Sensing, 62: 0 1--15, 2024. doi:10.1109/TGRS.2024.3396330

work page doi:10.1109/tgrs.2024.3396330 2024
[22]

and Tao, Dacheng , year=

Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129: 0 1789--1819, 2021. doi:10.1007/s11263-021-01453-z

work page doi:10.1007/s11263-021-01453-z 2021
[23]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000--16009, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Le...

work page 2022
[24]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. doi:10.48550/arXiv.1503.02531. NIPS 2015 Deep Learning Workshop

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1503.02531 2015
[25]

MCFA : Multi-scale cascade and feature adaptive alignment network for cross-view geo-localization

Kaiji Hou, Qiang Tong, Na Yan, Xiulei Liu, and Shoulu Hou. MCFA : Multi-scale cascade and feature adaptive alignment network for cross-view geo-localization. Sensors, 25 0 (14): 0 4519, 2025. doi:10.3390/s25144519

work page doi:10.3390/s25144519 2025
[26]

Sixing Hu, Mengdan Feng, Rang M. H. Nguyen, and Gim Hee Lee. CVM -net: Cross-view matching network for image-based ground-to-aerial geo-localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_CVM-Net_Cross-View_Matching_CVPR_2018_paper.html

work page 2018
[27]

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7482--7491, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Kendall_Multi-Task_Learning_Using_CVPR_2018_paper.html

work page 2018
[28]

Proxy anchor loss for deep metric learning

Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak. Proxy anchor loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3238--3247, 2020. doi:10.48550/arXiv.2003.13911. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Kim_Proxy_Anchor_Loss_for_Deep_Metric_Learning_CVPR_202...

work page doi:10.48550/arxiv.2003.13911 2020
[29]

Semi-Supervised Classification with Graph Convolutional Networks

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. doi:10.48550/arXiv.1609.02907

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.02907 2017
[30]

Alexander Lappe and Martin A. Giese. Register and [CLS] tokens induce a decoupling of local and global features in large ViTs . In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=KhavyzO9kK

work page 2025
[31]

GeoFormer : An effective Transformer -based Siamese network for UAV geolocalization

Qingge Li, Xiaogang Yang, Jiwei Fan, Ruitao Lu, Bin Tang, Siyu Wang, and Shuang Su. GeoFormer : An effective Transformer -based Siamese network for UAV geolocalization. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17: 0 9470--9491, 2024. doi:10.1109/JSTARS.2024.3392812

work page doi:10.1109/jstars.2024.3392812 2024
[32]

A self-adaptive feature extraction method for aerial-view geo-localization

Jinliang Lin, Zhiming Luo, Dazhen Lin, Shaozi Li, and Zhun Zhong. A self-adaptive feature extraction method for aerial-view geo-localization. IEEE Transactions on Image Processing, 34: 0 126--139, 2025. doi:10.1109/TIP.2024.3513157

work page doi:10.1109/tip.2024.3513157 2025
[33]

Learning deep representations for ground-to-aerial geolocalization

Tsung-Yi Lin, Yin Cui, Serge Belongie, and James Hays. Learning deep representations for ground-to-aerial geolocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. URL https://openaccess.thecvf.com/content_cvpr_2015/html/Lin_Learning_Deep_Representations_for_CVPR_2015_paper.html

work page 2015
[34]

SeGCN : A semantic-aware graph convolutional network for UAV geo-localization

Xiangzeng Liu, Ziyao Wang, Yue Wu, and Qiguang Miao. SeGCN : A semantic-aware graph convolutional network for UAV geo-localization. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17: 0 6055--6066, 2024. doi:10.1109/JSTARS.2024.3370612

work page doi:10.1109/jstars.2024.3370612 2024
[35]

Object-centric learning with slot attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. In Advances in Neural Information Processing Systems (NeurIPS), 2020. doi:10.48550/arXiv.2006.15055

work page doi:10.48550/arxiv.2006.15055 2020
[36]

SegCLIP : Patch aggregation with learnable centers for open-vocabulary semantic segmentation

Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. SegCLIP : Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In Proceedings of the International Conference on Machine Learning (ICML), 2023. doi:10.48550/arXiv.2211.14813

work page doi:10.48550/arxiv.2211.14813 2023
[37]

Let all be whitened: Multi-teacher distillation for efficient visual retrieval

Zhe Ma, Jianfeng Dong, Shouling Ji, Zhenguang Liu, Xuhong Zhang, Zonghui Wang, Sifeng He, Feng Qian, Xiaobo Zhang, and Lei Yang. Let all be whitened: Multi-teacher distillation for efficient visual retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5143--5151, 2024. URL https://ojs.aaai.org/index.php/AAAI/article...

work page 2024
[38]

Keith Nishihara

David Marr and H. Keith Nishihara. Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society of London. Series B, Biological Sciences, 200 0 (1140): 0 269--294, 1978. doi:10.1098/rspb.1978.0020

work page doi:10.1098/rspb.1978.0020 1978
[39]

DINOv2 : Learning robust visual features without supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2 : Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. URL https://openreview.net/pdf?id=GLm1BA3C8p

work page 2024
[40]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3967--3976, 2019. URL https://openaccess.thecvf.com/content_CVPR_2019/papers/Park_Relational_Knowledge_Distillation_CVPR_2019_paper.pdf

work page 2019
[41]

Perez , author F

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM : Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. doi:10.1609/aaai.v32i1.11671

work page doi:10.1609/aaai.v32i1.11671 2018
[42]

DINO-MSRA : A novel network architecture for cross-view image retrieval and localization of UAV and satellite images

Yifan Ping, Jun Lu, Haitao Guo, Qingfeng Hou, Kun Zhu, Zehao Sang, and Tong Liu. DINO-MSRA : A novel network architecture for cross-view image retrieval and localization of UAV and satellite images. Journal of Geo-information Science, 27 0 (7): 0 1608--1623, 2025. doi:10.12082/dqxxkx.2025.250051

work page doi:10.12082/dqxxkx.2025.250051 2025
[43]

Recent advances on jamming and spoofing detection in GNSS

Katarina Rado s , Marta Brki \'c , and Dinko Begu s i \'c . Recent advances on jamming and spoofing detection in GNSS . Sensors, 24 0 (13): 0 4210, 2024. doi:10.3390/s24134210

work page doi:10.3390/s24134210 2024
[44]

Schroff, D

Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet : A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. doi:10.1109/CVPR.2015.7298682. URL https://ieeexplore.ieee.org/document/7298682

work page doi:10.1109/cvpr.2015.7298682 2015
[45]

Multi-Task Learning as Multi-Objective Optimization

Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2018. doi:10.48550/arXiv.1810.04650

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04650 2018
[46]

MCCG : A ConvNeXt -based multiple-classifier method for cross-view geo-localization

Tianrui Shen, Yingmei Wei, Lai Kang, Shanshan Wan, and Yee-Hong Yang. MCCG : A ConvNeXt -based multiple-classifier method for cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (3): 0 1456--1468, 2024. doi:10.1109/TCSVT.2023.3296074

work page doi:10.1109/tcsvt.2023.3296074 2024
[47]

Where am I looking at? J oint location and orientation estimation by cross-view matching

Yujiao Shi, Xin Yu, Dylan Campbell, and Hongdong Li. Where am I looking at? J oint location and orientation estimation by cross-view matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Shi_Where_Am_I_Looking_At_Joint_Location_and_Orientation_E...

work page 2020
[48]

TirSA : A three stage approach for UAV -satellite cross-view geo-localization based on self-supervised feature enhancement

Jian Sun, Hao Sun, Lin Lei, Kefeng Ji, and Gangyao Kuang. TirSA : A three stage approach for UAV -satellite cross-view geo-localization based on self-supervised feature enhancement. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (9): 0 7882--7895, 2024. doi:10.1109/TCSVT.2024.3382717

work page doi:10.1109/tcsvt.2024.3382717 2024
[49]

Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)

Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), 2018. URL https://openaccess.thecvf.com/content_ECCV_2018/html/Yifan_Sun_Beyond_Part_Models_ECCV_2018_paper.html

work page 2018
[50]

In: Proceedings of the IEEE/CVF Conference on Computer 25 Vision and Pattern Recognition, pp

Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. doi:10.1109/CVPR42600.2020.00643. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Sun_...

work page doi:10.1109/cvpr42600.2020.00643 2020
[51]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems (NeurIPS), 2017. URL https://proceedings.neurips.cc/paper/2017/hash/5a61e2356a4a14f2a8c4e1a4c4c7e26a-Abstract.html

work page 2017
[52]

Contrastive representation distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/pdf?id=SkgpBJrtvS

work page 2020
[53]

Representation Learning with Contrastive Predictive Coding

A \"a ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. In Advances in Neural Information Processing Systems (NeurIPS), 2018. doi:10.48550/arXiv.1807.03748

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.03748 2018
[54]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

work page 2017
[55]

Graph Attention Networks

Petar Veli c kovi \'c , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li \`o , and Yoshua Bengio. Graph attention networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. doi:10.48550/arXiv.1710.10903. URL https://openreview.net/forum?id=rJXMpikCZ

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1710.10903 2018
[56]

Tent: Fully Test-time Adaptation by Entropy Minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. doi:10.48550/arXiv.2006.10726

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2006.10726 2021
[57]

Each part matters: Local patterns facilitate cross-view geo-localization

Tingyu Wang, Zhedong Zheng, Chenggang Yan, Jiyong Zhang, Yaoqi Sun, Bolun Zheng, and Yi Yang. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 32 0 (2): 0 867--879, 2022. doi:10.1109/TCSVT.2021.3061265

work page doi:10.1109/tcsvt.2021.3061265 2022
[58]

Multiple-environment self-adaptive network for aerial-view geo-localization

Tingyu Wang, Zhedong Zheng, Yaoqi Sun, Chenggang Yan, Yi Yang, and Tat-Seng Chua. Multiple-environment self-adaptive network for aerial-view geo-localization. Pattern Recognition, 152: 0 110363, 2024. doi:10.1016/j.patcog.2024.110363

work page doi:10.1016/j.patcog.2024.110363 2024
[59]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning (ICML), 2020. doi:10.48550/arXiv.2005.10242

work page doi:10.48550/arxiv.2005.10242 2020
[60]

Weatherprompt: Multi-modality representation learning for all-weather drone visual geo-localization

Jiahao Wen, Hang Yu, and Zhedong Zheng. Weatherprompt: Multi-modality representation learning for all-weather drone visual geo-localization. In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://nips.cc/virtual/2025/poster/118002

work page 2025
[61]

Wide-area image geolocalization with aerial reference imagery

Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference imagery. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. URL https://openaccess.thecvf.com/content_iccv_2015/html/Workman_Wide-Area_Image_Geolocalization_ICCV_2015_paper.html

work page 2015
[62]

CAMP : A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning

Qiong Wu, Yi Wan, Zhi Zheng, Yongjun Zhang, Guangshuai Wang, and Zhenyang Zhao. CAMP : A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning. IEEE Transactions on Geoscience and Remote Sensing, 62: 0 1--14, 2024. doi:10.1109/TGRS.2024.3448499

work page doi:10.1109/tgrs.2024.3448499 2024
[63]

Enhancing cross-view geo-localization with domain alignment and scene consistency

Panwang Xia, Yi Wan, Zhi Zheng, Yongjun Zhang, and Jiwei Deng. Enhancing cross-view geo-localization with domain alignment and scene consistency. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (12): 0 13271--13281, 2024. doi:10.1109/TCSVT.2024.3443510

work page doi:10.1109/tcsvt.2024.3443510 2024
[64]

Enhancing cross view geo localization through global local quadrant interaction network

Jin Xu, Junping Yin, Juan Zhang, and Tianyan Gao. Enhancing cross view geo localization through global local quadrant interaction network. Scientific Reports, 15: 0 33431, 2025 a . doi:10.1038/s41598-025-18935-6

work page doi:10.1038/s41598-025-18935-6 2025
[65]

Precise gps-denied uav self-positioning via context-enhanced cross-view geo-localization

Yuanze Xu, Ming Dai, Wenxiao Cai, and Wankou Yang. Precise gps-denied uav self-positioning via context-enhanced cross-view geo-localization. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 374--388. Springer, 2025 b . doi:10.1007/978-981-95-5628-1_26

work page doi:10.1007/978-981-95-5628-1_26 2025
[66]

DINOv2 -based UAV visual self-localization in low-altitude urban environments

Jiaqiang Yang, Danyang Qin, Huapeng Tang, Sili Tao, Haoze Bie, and Lin Ma. DINOv2 -based UAV visual self-localization in low-altitude urban environments. IEEE Robotics and Automation Letters, 10 0 (2): 0 2080--2087, 2025. doi:10.1109/LRA.2025.3527762

work page doi:10.1109/lra.2025.3527762 2080
[67]

Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark

Yibin Ye, Xichao Teng, Shuo Chen, Zhang Li, Leqi Liu, Qifeng Yu, and Tao Tan. Exploring the best way for UAV visual localization under low-altitude multi-view observation condition: A benchmark. In Findings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. doi:10.48550/arXiv.2503.10692

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.10692 2026
[68]

University-1652 : A multi-view multi-source benchmark for drone-based geo-localization

Zhedong Zheng, Yunchao Wei, and Yi Yang. University-1652 : A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM), pages 1395--1403, 2020. doi:10.1145/3394171.3413896

work page doi:10.1145/3394171.3413896 2020
[69]

iBOT : Image BERT pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT : Image BERT pre-training with online tokenizer. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/pdf?id=ydopy-e6Dg

work page 2022
[70]

SUES-200 : A multi-height multi-scene cross-view image benchmark across drone and satellite

Runzhe Zhu, Ling Yin, Mingze Yang, Fei Wu, Yuncheng Yang, and Wenbo Hu. SUES-200 : A multi-height multi-scene cross-view image benchmark across drone and satellite. IEEE Transactions on Circuits and Systems for Video Technology, 33 0 (9): 0 4825--4839, 2023. doi:10.1109/TCSVT.2023.3249204

work page doi:10.1109/tcsvt.2023.3249204 2023

[1] [1]

Emergence of invariance and disentanglement in deep representations

Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19 0 (50): 0 1--34, 2018. URL https://www.jmlr.org/papers/v19/17-646.html

work page 2018

[2] [2]

NetVLAD : CNN architecture for weakly supervised place recognition

Relja Arandjelovi \'c , Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. NetVLAD : CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. URL https://openaccess.thecvf.com/content_cvpr_2016/html/Arandjelovic_NetVLAD_CNN_Architecture_CVPR_2016_paper.html

work page 2016

[3] [3]

data2vec: A general framework for self-supervised learning in speech, vision and language

Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. doi:10.48550/arXiv.2202.03555

work page doi:10.48550/arxiv.2202.03555 2022

[4] [4]

Recognition-by-components: A theory of human image understanding

Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94 0 (2): 0 115--147, 1987. doi:10.1037/0033-295X.94.2.115

work page doi:10.1037/0033-295x.94.2.115 1987

[5] [5]

Walk in the cloud: Learning curves for point clouds shape analysis, pp

Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650--9660, 2021. doi:10.1109/ICCV48922.2021.00951. URL https://openaccess.thecvf.com/content...

work page doi:10.1109/iccv48922.2021.00951 2021

[6] [6]

SDPL : Shifting-dense partition learning for UAV -view geo-localization

Quan Chen, Tingyu Wang, Zihao Yang, Haoran Li, Rongfeng Lu, Yaoqi Sun, Bolun Zheng, and Chenggang Yan. SDPL : Shifting-dense partition learning for UAV -view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (11): 0 11810--11824, 2024. doi:10.1109/TCSVT.2024.3424196

work page doi:10.1109/tcsvt.2024.3424196 2024

[7] [7]

A simple framework for contrastive learning of visual representations

Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), 2020. URL https://proceedings.mlr.press/v119/chen20j.html

work page 2020

[8] [8]

An empirical study of training self-supervised vision transformers

Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. URL https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_An_Empirical_Study_of_Training_Self-Supervised_Vision_Transformers_ICCV_2021_paper.pdf

work page 2021

[9] [9]

Multilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization

Zhongwei Chen, Zhao-Xu Yang, and Hai-Jun Rong. Multilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization. IEEE Transactions on Geoscience and Remote Sensing, 63: 0 1--15, 2025. doi:10.1109/TGRS.2025.3572775

work page doi:10.1109/tgrs.2025.3572775 2025

[10] [10]

Group equivariant convolutional networks

Taco Cohen and Max Welling. Group equivariant convolutional networks. In International Conference on Machine Learning (ICML), 2016. URL https://proceedings.mlr.press/v48/cohenc16.html

work page 2016

[11] [11]

Akhloufi

Andy Couturier and Moulay A. Akhloufi. A review on absolute visual localization for UAV . Robotics and Autonomous Systems, 135: 0 103666, 2021. doi:10.1016/j.robot.2020.103666

work page doi:10.1016/j.robot.2020.103666 2021

[12] [12]

A transformer-based feature segmentation and region alignment method for UAV -view geo-localization

Ming Dai, Jianhong Hu, Jiedong Zhuang, and Enhui Zheng. A transformer-based feature segmentation and region alignment method for UAV -view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 32 0 (7): 0 4376--4389, 2022. doi:10.1109/TCSVT.2021.3135013

work page doi:10.1109/tcsvt.2021.3135013 2022

[13] [13]

Toward understanding asset flows in crypto money laundering through the lenses of ethereum heists

Ming Dai, Enhui Zheng, Zhenhua Feng, Lei Qi, Jiedong Zhuang, and Wankou Yang. Vision-based UAV self-positioning in low-altitude urban environments. IEEE Transactions on Image Processing, 33: 0 493--508, 2024. doi:10.1109/TIP.2023.3346279

work page doi:10.1109/tip.2023.3346279 2024

[14] [14]

In: 2023 IEEE/CVF International Conference on Computer Vision (ICCV)

Fabian Deuser, Konrad Habel, and Norbert Oswald. Sample4Geo : Hard negative sampling for cross-view geo-localisation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16801--16810, 2023. doi:10.1109/ICCV51070.2023.01545

work page doi:10.1109/iccv51070.2023.01545 2023

[15] [15]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representatio...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2010.11929 2021

[16] [16]

CCR : A counterfactual causal reasoning-based method for cross-view geo-localization

Haolin Du, Jingfei He, and Yuanqing Zhao. CCR : A counterfactual causal reasoning-based method for cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (11): 0 11630--11643, 2024. doi:10.1109/TCSVT.2024.3425509

work page doi:10.1109/tcsvt.2024.3425509 2024

[17] [17]

Dumoulin , author E

Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio. Feature-wise transformations. Distill, 2018. doi:10.23915/distill.00011

work page doi:10.23915/distill.00011 2018

[18] [18]

Multi-weather cross-view geo-localization using denoising diffusion models

Tongtong Feng, Qing Li, Xin Wang, Mingzi Wang, Guangyao Li, and Wenwu Zhu. Multi-weather cross-view geo-localization using denoising diffusion models. In Proceedings of the 2nd Workshop on UAV s in Multimedia (UAVM) , pages 35--39, 2024. doi:10.1145/3689095.3689103

work page doi:10.1145/3689095.3689103 2024

[19] [19]

Unsupervised Domain Adaptation by Backpropagation

Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML), 2015. doi:10.48550/arXiv.1409.7495

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1409.7495 2015

[20] [20]

Semantic concept perception network with interactive prompting for cross-view image geo-localization

Yuan Gao, Haibo Liu, and Xiaohui Wei. Semantic concept perception network with interactive prompting for cross-view image geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 35 0 (6): 0 5343--5354, 2025. doi:10.1109/TCSVT.2025.3533574

work page doi:10.1109/tcsvt.2025.3533574 2025

[21] [21]

Multilevel feedback joint representation learning network based on adaptive area elimination for cross-view geo-localization

Fawei Ge, Yunzhou Zhang, Li Wang, Wei Liu, Yixiu Liu, Sonya Coleman, and Dermot Kerr. Multilevel feedback joint representation learning network based on adaptive area elimination for cross-view geo-localization. IEEE Transactions on Geoscience and Remote Sensing, 62: 0 1--15, 2024. doi:10.1109/TGRS.2024.3396330

work page doi:10.1109/tgrs.2024.3396330 2024

[22] [22]

and Tao, Dacheng , year=

Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129: 0 1789--1819, 2021. doi:10.1007/s11263-021-01453-z

work page doi:10.1007/s11263-021-01453-z 2021

[23] [23]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000--16009, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Le...

work page 2022

[24] [24]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. doi:10.48550/arXiv.1503.02531. NIPS 2015 Deep Learning Workshop

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1503.02531 2015

[25] [25]

MCFA : Multi-scale cascade and feature adaptive alignment network for cross-view geo-localization

Kaiji Hou, Qiang Tong, Na Yan, Xiulei Liu, and Shoulu Hou. MCFA : Multi-scale cascade and feature adaptive alignment network for cross-view geo-localization. Sensors, 25 0 (14): 0 4519, 2025. doi:10.3390/s25144519

work page doi:10.3390/s25144519 2025

[26] [26]

Sixing Hu, Mengdan Feng, Rang M. H. Nguyen, and Gim Hee Lee. CVM -net: Cross-view matching network for image-based ground-to-aerial geo-localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_CVM-Net_Cross-View_Matching_CVPR_2018_paper.html

work page 2018

[27] [27]

Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7482--7491, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Kendall_Multi-Task_Learning_Using_CVPR_2018_paper.html

work page 2018

[28] [28]

Proxy anchor loss for deep metric learning

Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak. Proxy anchor loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3238--3247, 2020. doi:10.48550/arXiv.2003.13911. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Kim_Proxy_Anchor_Loss_for_Deep_Metric_Learning_CVPR_202...

work page doi:10.48550/arxiv.2003.13911 2020

[29] [29]

Semi-Supervised Classification with Graph Convolutional Networks

Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. doi:10.48550/arXiv.1609.02907

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1609.02907 2017

[30] [30]

Alexander Lappe and Martin A. Giese. Register and [CLS] tokens induce a decoupling of local and global features in large ViTs . In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=KhavyzO9kK

work page 2025

[31] [31]

GeoFormer : An effective Transformer -based Siamese network for UAV geolocalization

Qingge Li, Xiaogang Yang, Jiwei Fan, Ruitao Lu, Bin Tang, Siyu Wang, and Shuang Su. GeoFormer : An effective Transformer -based Siamese network for UAV geolocalization. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17: 0 9470--9491, 2024. doi:10.1109/JSTARS.2024.3392812

work page doi:10.1109/jstars.2024.3392812 2024

[32] [32]

A self-adaptive feature extraction method for aerial-view geo-localization

Jinliang Lin, Zhiming Luo, Dazhen Lin, Shaozi Li, and Zhun Zhong. A self-adaptive feature extraction method for aerial-view geo-localization. IEEE Transactions on Image Processing, 34: 0 126--139, 2025. doi:10.1109/TIP.2024.3513157

work page doi:10.1109/tip.2024.3513157 2025

[33] [33]

Learning deep representations for ground-to-aerial geolocalization

Tsung-Yi Lin, Yin Cui, Serge Belongie, and James Hays. Learning deep representations for ground-to-aerial geolocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. URL https://openaccess.thecvf.com/content_cvpr_2015/html/Lin_Learning_Deep_Representations_for_CVPR_2015_paper.html

work page 2015

[34] [34]

SeGCN : A semantic-aware graph convolutional network for UAV geo-localization

Xiangzeng Liu, Ziyao Wang, Yue Wu, and Qiguang Miao. SeGCN : A semantic-aware graph convolutional network for UAV geo-localization. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17: 0 6055--6066, 2024. doi:10.1109/JSTARS.2024.3370612

work page doi:10.1109/jstars.2024.3370612 2024

[35] [35]

Object-centric learning with slot attention

Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. In Advances in Neural Information Processing Systems (NeurIPS), 2020. doi:10.48550/arXiv.2006.15055

work page doi:10.48550/arxiv.2006.15055 2020

[36] [36]

SegCLIP : Patch aggregation with learnable centers for open-vocabulary semantic segmentation

Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. SegCLIP : Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In Proceedings of the International Conference on Machine Learning (ICML), 2023. doi:10.48550/arXiv.2211.14813

work page doi:10.48550/arxiv.2211.14813 2023

[37] [37]

Let all be whitened: Multi-teacher distillation for efficient visual retrieval

Zhe Ma, Jianfeng Dong, Shouling Ji, Zhenguang Liu, Xuhong Zhang, Zonghui Wang, Sifeng He, Feng Qian, Xiaobo Zhang, and Lei Yang. Let all be whitened: Multi-teacher distillation for efficient visual retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5143--5151, 2024. URL https://ojs.aaai.org/index.php/AAAI/article...

work page 2024

[38] [38]

Keith Nishihara

David Marr and H. Keith Nishihara. Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society of London. Series B, Biological Sciences, 200 0 (1140): 0 269--294, 1978. doi:10.1098/rspb.1978.0020

work page doi:10.1098/rspb.1978.0020 1978

[39] [39]

DINOv2 : Learning robust visual features without supervision

Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2 : Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. URL https://openreview.net/pdf?id=GLm1BA3C8p

work page 2024

[40] [40]

Relational knowledge distillation

Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3967--3976, 2019. URL https://openaccess.thecvf.com/content_CVPR_2019/papers/Park_Relational_Knowledge_Distillation_CVPR_2019_paper.pdf

work page 2019

[41] [41]

Perez , author F

Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM : Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. doi:10.1609/aaai.v32i1.11671

work page doi:10.1609/aaai.v32i1.11671 2018

[42] [42]

DINO-MSRA : A novel network architecture for cross-view image retrieval and localization of UAV and satellite images

Yifan Ping, Jun Lu, Haitao Guo, Qingfeng Hou, Kun Zhu, Zehao Sang, and Tong Liu. DINO-MSRA : A novel network architecture for cross-view image retrieval and localization of UAV and satellite images. Journal of Geo-information Science, 27 0 (7): 0 1608--1623, 2025. doi:10.12082/dqxxkx.2025.250051

work page doi:10.12082/dqxxkx.2025.250051 2025

[43] [43]

Recent advances on jamming and spoofing detection in GNSS

Katarina Rado s , Marta Brki \'c , and Dinko Begu s i \'c . Recent advances on jamming and spoofing detection in GNSS . Sensors, 24 0 (13): 0 4210, 2024. doi:10.3390/s24134210

work page doi:10.3390/s24134210 2024

[44] [44]

Schroff, D

Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet : A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. doi:10.1109/CVPR.2015.7298682. URL https://ieeexplore.ieee.org/document/7298682

work page doi:10.1109/cvpr.2015.7298682 2015

[45] [45]

Multi-Task Learning as Multi-Objective Optimization

Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2018. doi:10.48550/arXiv.1810.04650

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1810.04650 2018

[46] [46]

MCCG : A ConvNeXt -based multiple-classifier method for cross-view geo-localization

Tianrui Shen, Yingmei Wei, Lai Kang, Shanshan Wan, and Yee-Hong Yang. MCCG : A ConvNeXt -based multiple-classifier method for cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (3): 0 1456--1468, 2024. doi:10.1109/TCSVT.2023.3296074

work page doi:10.1109/tcsvt.2023.3296074 2024

[47] [47]

Where am I looking at? J oint location and orientation estimation by cross-view matching

Yujiao Shi, Xin Yu, Dylan Campbell, and Hongdong Li. Where am I looking at? J oint location and orientation estimation by cross-view matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Shi_Where_Am_I_Looking_At_Joint_Location_and_Orientation_E...

work page 2020

[48] [48]

TirSA : A three stage approach for UAV -satellite cross-view geo-localization based on self-supervised feature enhancement

Jian Sun, Hao Sun, Lin Lei, Kefeng Ji, and Gangyao Kuang. TirSA : A three stage approach for UAV -satellite cross-view geo-localization based on self-supervised feature enhancement. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (9): 0 7882--7895, 2024. doi:10.1109/TCSVT.2024.3382717

work page doi:10.1109/tcsvt.2024.3382717 2024

[49] [49]

Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)

Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), 2018. URL https://openaccess.thecvf.com/content_ECCV_2018/html/Yifan_Sun_Beyond_Part_Models_ECCV_2018_paper.html

work page 2018

[50] [50]

In: Proceedings of the IEEE/CVF Conference on Computer 25 Vision and Pattern Recognition, pp

Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. doi:10.1109/CVPR42600.2020.00643. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Sun_...

work page doi:10.1109/cvpr42600.2020.00643 2020

[51] [51]

Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems (NeurIPS), 2017. URL https://proceedings.neurips.cc/paper/2017/hash/5a61e2356a4a14f2a8c4e1a4c4c7e26a-Abstract.html

work page 2017

[52] [52]

Contrastive representation distillation

Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/pdf?id=SkgpBJrtvS

work page 2020

[53] [53]

Representation Learning with Contrastive Predictive Coding

A \"a ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. In Advances in Neural Information Processing Systems (NeurIPS), 2018. doi:10.48550/arXiv.1807.03748

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1807.03748 2018

[54] [54]

Gomez, Lukasz Kaiser, and Illia Polosukhin

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

work page 2017

[55] [55]

Graph Attention Networks

Petar Veli c kovi \'c , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li \`o , and Yoshua Bengio. Graph attention networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. doi:10.48550/arXiv.1710.10903. URL https://openreview.net/forum?id=rJXMpikCZ

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.1710.10903 2018

[56] [56]

Tent: Fully Test-time Adaptation by Entropy Minimization

Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. doi:10.48550/arXiv.2006.10726

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2006.10726 2021

[57] [57]

Each part matters: Local patterns facilitate cross-view geo-localization

Tingyu Wang, Zhedong Zheng, Chenggang Yan, Jiyong Zhang, Yaoqi Sun, Bolun Zheng, and Yi Yang. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 32 0 (2): 0 867--879, 2022. doi:10.1109/TCSVT.2021.3061265

work page doi:10.1109/tcsvt.2021.3061265 2022

[58] [58]

Multiple-environment self-adaptive network for aerial-view geo-localization

Tingyu Wang, Zhedong Zheng, Yaoqi Sun, Chenggang Yan, Yi Yang, and Tat-Seng Chua. Multiple-environment self-adaptive network for aerial-view geo-localization. Pattern Recognition, 152: 0 110363, 2024. doi:10.1016/j.patcog.2024.110363

work page doi:10.1016/j.patcog.2024.110363 2024

[59] [59]

Understanding contrastive representation learning through alignment and uniformity on the hypersphere

Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning (ICML), 2020. doi:10.48550/arXiv.2005.10242

work page doi:10.48550/arxiv.2005.10242 2020

[60] [60]

Weatherprompt: Multi-modality representation learning for all-weather drone visual geo-localization

Jiahao Wen, Hang Yu, and Zhedong Zheng. Weatherprompt: Multi-modality representation learning for all-weather drone visual geo-localization. In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://nips.cc/virtual/2025/poster/118002

work page 2025

[61] [61]

Wide-area image geolocalization with aerial reference imagery

Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference imagery. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. URL https://openaccess.thecvf.com/content_iccv_2015/html/Workman_Wide-Area_Image_Geolocalization_ICCV_2015_paper.html

work page 2015

[62] [62]

CAMP : A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning

Qiong Wu, Yi Wan, Zhi Zheng, Yongjun Zhang, Guangshuai Wang, and Zhenyang Zhao. CAMP : A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning. IEEE Transactions on Geoscience and Remote Sensing, 62: 0 1--14, 2024. doi:10.1109/TGRS.2024.3448499

work page doi:10.1109/tgrs.2024.3448499 2024

[63] [63]

Enhancing cross-view geo-localization with domain alignment and scene consistency

Panwang Xia, Yi Wan, Zhi Zheng, Yongjun Zhang, and Jiwei Deng. Enhancing cross-view geo-localization with domain alignment and scene consistency. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (12): 0 13271--13281, 2024. doi:10.1109/TCSVT.2024.3443510

work page doi:10.1109/tcsvt.2024.3443510 2024

[64] [64]

Enhancing cross view geo localization through global local quadrant interaction network

Jin Xu, Junping Yin, Juan Zhang, and Tianyan Gao. Enhancing cross view geo localization through global local quadrant interaction network. Scientific Reports, 15: 0 33431, 2025 a . doi:10.1038/s41598-025-18935-6

work page doi:10.1038/s41598-025-18935-6 2025

[65] [65]

Precise gps-denied uav self-positioning via context-enhanced cross-view geo-localization

Yuanze Xu, Ming Dai, Wenxiao Cai, and Wankou Yang. Precise gps-denied uav self-positioning via context-enhanced cross-view geo-localization. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 374--388. Springer, 2025 b . doi:10.1007/978-981-95-5628-1_26

work page doi:10.1007/978-981-95-5628-1_26 2025

[66] [66]

DINOv2 -based UAV visual self-localization in low-altitude urban environments

Jiaqiang Yang, Danyang Qin, Huapeng Tang, Sili Tao, Haoze Bie, and Lin Ma. DINOv2 -based UAV visual self-localization in low-altitude urban environments. IEEE Robotics and Automation Letters, 10 0 (2): 0 2080--2087, 2025. doi:10.1109/LRA.2025.3527762

work page doi:10.1109/lra.2025.3527762 2080

[67] [67]

Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark

Yibin Ye, Xichao Teng, Shuo Chen, Zhang Li, Leqi Liu, Qifeng Yu, and Tao Tan. Exploring the best way for UAV visual localization under low-altitude multi-view observation condition: A benchmark. In Findings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. doi:10.48550/arXiv.2503.10692

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2503.10692 2026

[68] [68]

University-1652 : A multi-view multi-source benchmark for drone-based geo-localization

Zhedong Zheng, Yunchao Wei, and Yi Yang. University-1652 : A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM), pages 1395--1403, 2020. doi:10.1145/3394171.3413896

work page doi:10.1145/3394171.3413896 2020

[69] [69]

iBOT : Image BERT pre-training with online tokenizer

Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT : Image BERT pre-training with online tokenizer. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/pdf?id=ydopy-e6Dg

work page 2022

[70] [70]

SUES-200 : A multi-height multi-scene cross-view image benchmark across drone and satellite

Runzhe Zhu, Ling Yin, Mingze Yang, Fei Wu, Yuncheng Yang, and Wenbo Hu. SUES-200 : A multi-height multi-scene cross-view image benchmark across drone and satellite. IEEE Transactions on Circuits and Systems for Video Technology, 33 0 (9): 0 4825--4839, 2023. doi:10.1109/TCSVT.2023.3249204

work page doi:10.1109/tcsvt.2023.3249204 2023