pith. machine review for the scientific record. sign in

arxiv: 2605.11654 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI· cs.RO

Recognition: 2 theorem links

· Lean Theorem

Weather-Robust Cross-View Geo-Localization via Prototype-Based Semantic Part Discovery

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:26 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.RO
keywords cross-view geo-localizationsemantic part discoveryprototype learningdrone navigationweather robustnessvision transformersmulti-objective optimization
0
0 comments X

The pith

SkyPart discovers semantic parts in drone and satellite images using competing learnable prototypes to match views despite weather and altitude changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents SkyPart as a swappable head for patch-based vision transformers that groups image patches into semantic parts to solve cross-view geo-localization. Existing global-descriptor approaches mix layout and texture across the drastic view gap between oblique drone shots and overhead satellite tiles, while also retaining altitude scale in the final embedding and requiring manual loss balancing. SkyPart counters this with single-pass cosine assignment of patches to learnable prototypes, altitude-conditioned modulation used only in training, graph-attention readout, and uncertainty-weighted multi-objective losses. A reader would care because the resulting 26.95M-parameter model reaches new accuracy on three standard benchmarks in a single forward pass and shows growing gains when weather corruptions are added.

Core claim

SkyPart institutes explicit part grouping over the patch grid with four components: learnable prototypes that compete for patch tokens via single-pass cosine assignment, altitude-conditioned linear modulation applied only during training to produce altitude-free retrieval embeddings at inference, graph-attention readout over active prototypes, and a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs it is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under single-pass, no-re-ranking, no-TTA evaluation. Its margin over the strongest prior m

What carries the argument

Learnable prototypes that perform single-pass cosine assignment to group patch tokens into semantic parts that separate layout from texture across view gaps.

If this is right

  • Accuracy leads widen on SUES-200, University-1652, and DenseUAV when weather corruptions are introduced.
  • The model runs with lower compute than prior top methods while requiring no re-ranking or test-time augmentation.
  • Altitude scale is removed from the embedding without any altitude input needed at inference time.
  • Multi-objective training reaches Pareto-stationary points without hand-tuned loss scalars.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The prototype assignment mechanism could be tested on ground-to-satellite localization tasks that share similar layout-texture separation needs.
  • Replacing fixed prototype count with a learned or dynamic number might handle scenes of varying complexity without retraining.
  • The uncertainty-weighted loss could transfer to other vision tasks that combine objectives with mismatched gradient magnitudes.

Load-bearing premise

Single-pass cosine assignment of patches to learnable prototypes will reliably discover semantic parts that separate layout from texture across the view gap, and altitude modulation used only in training will produce an altitude-invariant embedding at inference without losing discriminative power.

What would settle it

Retrieval accuracy falling below the strongest baseline on a held-out set of drone-satellite pairs captured under previously unseen extreme weather or altitude combinations where prototype assignments collapse layout and texture.

Figures

Figures reproduced from arXiv: 2605.11654 by Chi-Nguyen Tran, Dao Sy Duy Minh, Huynh Trung Kiet, Long Tran-Thanh, Nguyen Lam Phu Quy, Phu-Hoa Pham.

Figure 1
Figure 1. Figure 1: SKYPART overview. A shared DINOv2 ViT-S/14 encodes drone and satellite views; three readouts (global CLS, semantic parts with K learnable prototypes under altitude-conditioned FiLM, and a prototype GAT for layout) are merged by a learned fusion gate into a 768-D ℓ2-normalised embedding, retrieved by cosine similarity in one pass (no re-ranking, no TTA). Bottom: training￾only GEOPARTLOSS with four uncertain… view at source ↗
Figure 2
Figure 2. Figure 2: Part-level evidence under weather shifts. Rows show clean drone inputs, their part￾level activations, paired satellite views, satellite part activations, weather-corrupted drone queries, and the corresponding part activations. Columns cover different corruptions and mixed weather conditions. Across substantial appearance changes, the part-discovery head continues to produce spatially structured activations… view at source ↗
Figure 3
Figure 3. Figure 3: Weather conditions. The same drone image under 10 WeatherPrompt augmentations. Texture is destroyed, but spatial structure persists-a pattern qualitatively aligned with layout-heavy representations and with SKYPART’s relative robustness under environmental noise. A3.4.1 Evaluation Protocol The evaluation protocol follows the WeatherPrompt guidelines: the satellite gallery remains clean while drone queries … view at source ↗
Figure 4
Figure 4. Figure 4: Weather robustness across three benchmarks (radar view). Per-condition Drone→Satellite R@1 (%) under the ten WeatherPrompt corruptions on SUES-200, University￾1652, and DenseUAV. SKYPART (red, filled) maintains a near-circular profile, indicating uniform robustness across all conditions, while baselines collapse on hard regimes (F+S, Dark). Numerical breakdown matches [PITH_FULL_IMAGE:figures/full_fig_p02… view at source ↗
Figure 5
Figure 5. Figure 5: Pareto efficiency across two benchmarks (D→S). R@1 vs. model size (params); bubble area ∝ GFLOPs. SKYPART (blue star) is Pareto-optimal on both SUES-200 (left) and University￾1652 (right), using fewer parameters and substantially lower compute than every baseline. Single-pass 448×448; no re-ranking, no TTA. A4.2 Limitations and Broader Impact Our train/test splits share a geographic region; cross-city or c… view at source ↗
Figure 6
Figure 6. Figure 6: Drone→Satellite top-5 retrieval. Each row is a drone query at a given altitude (row label), followed by the SKYPART part-attention heat map and the 5 highest-ranked satellite matches. Amber = correct, blue = incorrect. Geometric and transport priors. Polar warping [Shi et al., 2020] is the standard preprocessing for ground-panorama geometry, but on aerial tiles the reprojection is wrong and the train/test … view at source ↗
Figure 7
Figure 7. Figure 7: Satellite→Drone top-5 retrieval. Each row is a satellite query, its SKYPART part-attention heat map, and the top-5 drone images SKYPART retrieves across altitudes. Amber = correct, blue = incorrect. numbers because they measure something different from the embedding itself. Each added 1–4 pp on at least one benchmark; a proper evaluation of how they compound with SKYPART is left for future work. 29 [PITH_… view at source ↗
read the original abstract

Cross-view geo-localization (CVGL), which matches an oblique drone view to a geo-referenced satellite tile, has emerged as a key alternative for autonomous drone navigation when GNSS signals are jammed, spoofed, or unavailable. Despite strong recent progress, three limitations persist: (1) global-descriptor designs compress the patch grid into a single vector without separating layout from texture across the view gap; (2) altitude-related scale variation is retained in the learned embedding rather than marginalized; and (3) multi-objective training relies on hand-tuned scalars over losses on incompatible gradient scales. We propose SkyPart, a lightweight swappable head for patch-based vision transformers (ViTs) that institutes explicit part grouping over the patch grid. SkyPart has four theory-grounded components: (i) learnable prototypes competing for patch tokens via single-pass cosine assignment; (ii) altitude-conditioned linear modulation applied only during training, making the retrieval embedding altitude-free at inference; (iii) a graph-attention readout over active prototypes; and (iv) a Kendall uncertainty-weighted multi-objective loss whose stationary points are Pareto-stationary. At 26.95M parameters and 22.14 GFLOPs, SkyPart is the smallest among top-performing methods and sets a new state of the art on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol. Its advantage over the strongest baseline widens under the ten-condition WeatherPrompt corruption benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes SkyPart, a lightweight swappable head for patch-based ViTs in cross-view geo-localization. It introduces four components: (i) learnable prototypes assigned to patch tokens via single-pass cosine similarity, (ii) altitude-conditioned linear modulation applied only at training time, (iii) graph-attention readout over active prototypes, and (iv) a Kendall uncertainty-weighted multi-objective loss. The method claims SOTA results on SUES-200, University-1652, and DenseUAV under a single-pass, no-re-ranking, no-TTA protocol, with 26.95M parameters and 22.14 GFLOPs (smallest among top methods), plus a widening advantage over baselines on a ten-condition WeatherPrompt corruption benchmark.

Significance. If the empirical claims hold under the stated protocol, the work offers a parameter-efficient, explicitly part-aware approach to CVGL that marginalizes altitude variation and improves weather robustness. The fixed single-pass evaluation protocol and dedicated weather benchmark are strengths that could aid reproducibility and practical deployment in GNSS-denied drone navigation.

major comments (2)
  1. [Abstract] Abstract, component (i): the single-pass cosine assignment of patch tokens to learnable prototypes is asserted to discover semantic parts that separate layout from texture across the view gap, yet no invariance to scale, illumination, or viewpoint is built into the cosine operation on raw ViT tokens; without additional constraints, visualizations, or ablations demonstrating consistent layout isolation (rather than texture or weather clustering), this assumption is load-bearing for both the reported accuracy gains and the widened weather-robustness margin.
  2. [Abstract] Abstract and method description: the manuscript states clear SOTA numbers and a widening weather gap but provides no quantitative ablation studies, error bars, or full training details (including prototype count, loss weighting schedules, and hyperparameter sensitivity); this absence prevents confirmation that the gains are robust rather than sensitive to unstated benchmark choices or post-hoc protocol decisions.
minor comments (1)
  1. [Abstract] The abstract refers to 'theory-grounded components' and 'Pareto-stationary' points for the Kendall-weighted loss, but the main text should explicitly link each component to its theoretical grounding with a short derivation or reference.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and for recognizing the potential of SkyPart as a parameter-efficient approach to cross-view geo-localization with improved weather robustness. We address each major comment below, providing clarifications on the design rationale and committing to revisions that supply the requested evidence and details.

read point-by-point responses
  1. Referee: [Abstract] Abstract, component (i): the single-pass cosine assignment of patch tokens to learnable prototypes is asserted to discover semantic parts that separate layout from texture across the view gap, yet no invariance to scale, illumination, or viewpoint is built into the cosine operation on raw ViT tokens; without additional constraints, visualizations, or ablations demonstrating consistent layout isolation (rather than texture or weather clustering), this assumption is load-bearing for both the reported accuracy gains and the widened weather-robustness margin.

    Authors: We agree that cosine similarity on raw ViT tokens lacks explicit invariance mechanisms. The part discovery emerges from end-to-end optimization: the prototypes compete to explain patch tokens under the joint CVGL objective, altitude modulation, and Kendall-weighted loss, which penalizes reliance on transient texture or weather cues. To directly address the concern, we will add visualizations of prototype-to-patch assignments on paired drone-satellite images under varying weather and viewpoints, plus targeted ablations that replace the prototype grouping with standard pooling or attention while keeping all other components fixed. These will quantify whether the groupings isolate layout semantics rather than texture or corruption patterns. revision: yes

  2. Referee: [Abstract] Abstract and method description: the manuscript states clear SOTA numbers and a widening weather gap but provides no quantitative ablation studies, error bars, or full training details (including prototype count, loss weighting schedules, and hyperparameter sensitivity); this absence prevents confirmation that the gains are robust rather than sensitive to unstated benchmark choices or post-hoc protocol decisions.

    Authors: We acknowledge that the current version does not contain sufficient quantitative ablations, error bars, or exhaustive training specifications to fully substantiate robustness. We will expand the experiments section with (i) component-wise ablation tables reporting mean and standard deviation over multiple random seeds, (ii) the precise prototype count and initialization, (iii) the Kendall uncertainty weighting schedule and its evolution during training, and (iv) a hyperparameter sensitivity study on prototype count and loss coefficients. These additions will be presented in new tables and text to enable independent verification of the SOTA claims and the weather-robustness margin. revision: yes

Circularity Check

0 steps flagged

No circularity: components empirically validated on external benchmarks

full rationale

The paper introduces SkyPart as a swappable head on standard ViT backbones, with components (learnable prototypes via cosine assignment, altitude-conditioned modulation, graph-attention readout, Kendall-weighted loss) whose effectiveness is measured directly on public external datasets (SUES-200, University-1652, DenseUAV) under a fixed single-pass protocol. No equations, predictions, or uniqueness claims reduce the reported SOTA metrics or weather-robustness gains to quantities defined by the paper's own fitted parameters or self-citations. The derivation chain remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 2 axioms · 0 invented entities

The central claim rests on standard deep-learning assumptions plus two domain-specific premises that are not independently verified in the abstract: that cosine-based prototype assignment discovers semantically meaningful parts separating layout from texture, and that altitude modulation can be isolated to training without harming inference embeddings. No new physical entities are postulated.

free parameters (1)
  • number of prototypes
    The count of competing learnable prototypes is a hyperparameter whose value is not stated in the abstract.
axioms (2)
  • domain assumption Single-pass cosine assignment to learnable prototypes discovers semantic parts that separate layout from texture across view gaps
    Invoked in the first component of SkyPart.
  • domain assumption Altitude-conditioned linear modulation applied only in training yields altitude-free embeddings at inference
    Invoked in the second component.

pith-pipeline@v0.9.0 · 5601 in / 1593 out tokens · 61646 ms · 2026-05-13T01:26:12.072935+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 6 internal anchors

  1. [1]

    Emergence of invariance and disentanglement in deep representations

    Alessandro Achille and Stefano Soatto. Emergence of invariance and disentanglement in deep representations. Journal of Machine Learning Research, 19 0 (50): 0 1--34, 2018. URL https://www.jmlr.org/papers/v19/17-646.html

  2. [2]

    NetVLAD : CNN architecture for weakly supervised place recognition

    Relja Arandjelovi \'c , Petr Gronat, Akihiko Torii, Tomas Pajdla, and Josef Sivic. NetVLAD : CNN architecture for weakly supervised place recognition. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2016. URL https://openaccess.thecvf.com/content_cvpr_2016/html/Arandjelovic_NetVLAD_CNN_Architecture_CVPR_2016_paper.html

  3. [3]

    Data2vec: A general framework for self-supervised learning in speech, vision and language,

    Alexei Baevski, Wei-Ning Hsu, Qiantong Xu, Arun Babu, Jiatao Gu, and Michael Auli. data2vec: A general framework for self-supervised learning in speech, vision and language. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. doi:10.48550/arXiv.2202.03555

  4. [4]

    Recognition-by-components: A theory of human image understanding

    Irving Biederman. Recognition-by-components: A theory of human image understanding. Psychological Review, 94 0 (2): 0 115--147, 1987. doi:10.1037/0033-295X.94.2.115

  5. [5]

    Khan, and Fahad Shah- baz Khan

    Mathilde Caron, Hugo Touvron, Ishan Misra, Herv \'e J \'e gou, Julien Mairal, Piotr Bojanowski, and Armand Joulin. Emerging properties in self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 9650--9660, 2021. doi:10.1109/ICCV48922.2021.00951. URL https://openaccess.thecvf.com/content...

  6. [6]

    SDPL : Shifting-dense partition learning for UAV -view geo-localization

    Quan Chen, Tingyu Wang, Zihao Yang, Haoran Li, Rongfeng Lu, Yaoqi Sun, Bolun Zheng, and Chenggang Yan. SDPL : Shifting-dense partition learning for UAV -view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (11): 0 11810--11824, 2024. doi:10.1109/TCSVT.2024.3424196

  7. [7]

    A simple framework for contrastive learning of visual representations

    Ting Chen, Simon Kornblith, Mohammad Norouzi, and Geoffrey Hinton. A simple framework for contrastive learning of visual representations. In International Conference on Machine Learning (ICML), 2020. URL https://proceedings.mlr.press/v119/chen20j.html

  8. [8]

    An empirical study of training self-supervised vision transformers

    Xinlei Chen, Saining Xie, and Kaiming He. An empirical study of training self-supervised vision transformers. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021. URL https://openaccess.thecvf.com/content/ICCV2021/papers/Chen_An_Empirical_Study_of_Training_Self-Supervised_Vision_Transformers_ICCV_2021_paper.pdf

  9. [9]

    Multilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization

    Zhongwei Chen, Zhao-Xu Yang, and Hai-Jun Rong. Multilevel embedding and alignment network with consistency and invariance learning for cross-view geo-localization. IEEE Transactions on Geoscience and Remote Sensing, 63: 0 1--15, 2025. doi:10.1109/TGRS.2025.3572775

  10. [10]

    Group equivariant convolutional networks

    Taco Cohen and Max Welling. Group equivariant convolutional networks. In International Conference on Machine Learning (ICML), 2016. URL https://proceedings.mlr.press/v48/cohenc16.html

  11. [11]

    Akhloufi

    Andy Couturier and Moulay A. Akhloufi. A review on absolute visual localization for UAV . Robotics and Autonomous Systems, 135: 0 103666, 2021. doi:10.1016/j.robot.2020.103666

  12. [12]

    A transformer-based feature segmentation and region alignment method for UAV -view geo-localization

    Ming Dai, Jianhong Hu, Jiedong Zhuang, and Enhui Zheng. A transformer-based feature segmentation and region alignment method for UAV -view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 32 0 (7): 0 4376--4389, 2022. doi:10.1109/TCSVT.2021.3135013

  13. [13]

    Vision-based UAV self-positioning in low-altitude urban environments

    Ming Dai, Enhui Zheng, Zhenhua Feng, Lei Qi, Jiedong Zhuang, and Wankou Yang. Vision-based UAV self-positioning in low-altitude urban environments. IEEE Transactions on Image Processing, 33: 0 493--508, 2024. doi:10.1109/TIP.2023.3346279

  14. [14]

    Kirillov, E

    Fabian Deuser, Konrad Habel, and Norbert Oswald. Sample4Geo : Hard negative sampling for cross-view geo-localisation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 16801--16810, 2023. doi:10.1109/ICCV51070.2023.01545

  15. [15]

    An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

    Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, and Neil Houlsby. An image is worth 16x16 words: Transformers for image recognition at scale. In Proceedings of the International Conference on Learning Representatio...

  16. [16]

    CCR : A counterfactual causal reasoning-based method for cross-view geo-localization

    Haolin Du, Jingfei He, and Yuanqing Zhao. CCR : A counterfactual causal reasoning-based method for cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (11): 0 11630--11643, 2024. doi:10.1109/TCSVT.2024.3425509

  17. [17]

    Feature-wise transformations

    Vincent Dumoulin, Ethan Perez, Nathan Schucher, Florian Strub, Harm de Vries, Aaron Courville, and Yoshua Bengio. Feature-wise transformations. Distill, 2018. doi:10.23915/distill.00011

  18. [18]

    Multi-weather cross-view geo-localization using denoising diffusion models

    Tongtong Feng, Qing Li, Xin Wang, Mingzi Wang, Guangyao Li, and Wenwu Zhu. Multi-weather cross-view geo-localization using denoising diffusion models. In Proceedings of the 2nd Workshop on UAV s in Multimedia (UAVM) , pages 35--39, 2024. doi:10.1145/3689095.3689103

  19. [19]

    Unsupervised domain adaptation by backpropagation

    Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In International Conference on Machine Learning (ICML), 2015. doi:10.48550/arXiv.1409.7495

  20. [20]

    Semantic concept perception network with interactive prompting for cross-view image geo-localization

    Yuan Gao, Haibo Liu, and Xiaohui Wei. Semantic concept perception network with interactive prompting for cross-view image geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 35 0 (6): 0 5343--5354, 2025. doi:10.1109/TCSVT.2025.3533574

  21. [21]

    Multilevel feedback joint representation learning network based on adaptive area elimination for cross-view geo-localization

    Fawei Ge, Yunzhou Zhang, Li Wang, Wei Liu, Yixiu Liu, Sonya Coleman, and Dermot Kerr. Multilevel feedback joint representation learning network based on adaptive area elimination for cross-view geo-localization. IEEE Transactions on Geoscience and Remote Sensing, 62: 0 1--15, 2024. doi:10.1109/TGRS.2024.3396330

  22. [22]

    Maybank, and Dacheng Tao

    Jianping Gou, Baosheng Yu, Stephen J. Maybank, and Dacheng Tao. Knowledge distillation: A survey. International Journal of Computer Vision, 129: 0 1789--1819, 2021. doi:10.1007/s11263-021-01453-z

  23. [23]

    Masked autoencoders are scalable vision learners

    Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll \'a r, and Ross Girshick. Masked autoencoders are scalable vision learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 16000--16009, 2022. URL https://openaccess.thecvf.com/content/CVPR2022/html/He_Masked_Autoencoders_Are_Scalable_Vision_Le...

  24. [24]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distilling the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. doi:10.48550/arXiv.1503.02531. NIPS 2015 Deep Learning Workshop

  25. [25]

    MCFA : Multi-scale cascade and feature adaptive alignment network for cross-view geo-localization

    Kaiji Hou, Qiang Tong, Na Yan, Xiulei Liu, and Shoulu Hou. MCFA : Multi-scale cascade and feature adaptive alignment network for cross-view geo-localization. Sensors, 25 0 (14): 0 4519, 2025. doi:10.3390/s25144519

  26. [26]

    Sixing Hu, Mengdan Feng, Rang M. H. Nguyen, and Gim Hee Lee. CVM -net: Cross-view matching network for image-based ground-to-aerial geo-localization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Hu_CVM-Net_Cross-View_Matching_CVPR_2018_paper.html

  27. [27]

    Multi-task learning using uncertainty to weigh losses for scene geometry and semantics

    Alex Kendall, Yarin Gal, and Roberto Cipolla. Multi-task learning using uncertainty to weigh losses for scene geometry and semantics. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 7482--7491, 2018. URL https://openaccess.thecvf.com/content_cvpr_2018/html/Kendall_Multi-Task_Learning_Using_CVPR_2018_paper.html

  28. [28]

    Proxy anchor loss for deep metric learning

    Sungyeon Kim, Dongwon Kim, Minsu Cho, and Suha Kwak. Proxy anchor loss for deep metric learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3238--3247, 2020. doi:10.48550/arXiv.2003.13911. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Kim_Proxy_Anchor_Loss_for_Deep_Metric_Learning_CVPR_202...

  29. [29]

    Semi-Supervised Classification with Graph Convolutional Networks

    Thomas N. Kipf and Max Welling. Semi-supervised classification with graph convolutional networks. In International Conference on Learning Representations (ICLR), 2017. doi:10.48550/arXiv.1609.02907

  30. [30]

    Alexander Lappe and Martin A. Giese. Register and [CLS] tokens induce a decoupling of local and global features in large ViTs . In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://openreview.net/forum?id=KhavyzO9kK

  31. [31]

    GeoFormer : An effective Transformer -based Siamese network for UAV geolocalization

    Qingge Li, Xiaogang Yang, Jiwei Fan, Ruitao Lu, Bin Tang, Siyu Wang, and Shuang Su. GeoFormer : An effective Transformer -based Siamese network for UAV geolocalization. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17: 0 9470--9491, 2024. doi:10.1109/JSTARS.2024.3392812

  32. [32]

    A self-adaptive feature extraction method for aerial-view geo-localization

    Jinliang Lin, Zhiming Luo, Dazhen Lin, Shaozi Li, and Zhun Zhong. A self-adaptive feature extraction method for aerial-view geo-localization. IEEE Transactions on Image Processing, 34: 0 126--139, 2025. doi:10.1109/TIP.2024.3513157

  33. [33]

    Learning deep representations for ground-to-aerial geolocalization

    Tsung-Yi Lin, Yin Cui, Serge Belongie, and James Hays. Learning deep representations for ground-to-aerial geolocalization. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. URL https://openaccess.thecvf.com/content_cvpr_2015/html/Lin_Learning_Deep_Representations_for_CVPR_2015_paper.html

  34. [34]

    SeGCN : A semantic-aware graph convolutional network for UAV geo-localization

    Xiangzeng Liu, Ziyao Wang, Yue Wu, and Qiguang Miao. SeGCN : A semantic-aware graph convolutional network for UAV geo-localization. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 17: 0 6055--6066, 2024. doi:10.1109/JSTARS.2024.3370612

  35. [35]

    Object-centric learning with slot attention

    Francesco Locatello, Dirk Weissenborn, Thomas Unterthiner, Aravindh Mahendran, Georg Heigold, Jakob Uszkoreit, Alexey Dosovitskiy, and Thomas Kipf. Object-centric learning with slot attention. In Advances in Neural Information Processing Systems (NeurIPS), 2020. doi:10.48550/arXiv.2006.15055

  36. [36]

    SegCLIP : Patch aggregation with learnable centers for open-vocabulary semantic segmentation

    Huaishao Luo, Junwei Bao, Youzheng Wu, Xiaodong He, and Tianrui Li. SegCLIP : Patch aggregation with learnable centers for open-vocabulary semantic segmentation. In Proceedings of the International Conference on Machine Learning (ICML), 2023. doi:10.48550/arXiv.2211.14813

  37. [37]

    Let all be whitened: Multi-teacher distillation for efficient visual retrieval

    Zhe Ma, Jianfeng Dong, Shouling Ji, Zhenguang Liu, Xuhong Zhang, Zonghui Wang, Sifeng He, Feng Qian, Xiaobo Zhang, and Lei Yang. Let all be whitened: Multi-teacher distillation for efficient visual retrieval. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 5143--5151, 2024. URL https://ojs.aaai.org/index.php/AAAI/article...

  38. [38]

    Keith Nishihara

    David Marr and H. Keith Nishihara. Representation and recognition of the spatial organization of three-dimensional shapes. Proceedings of the Royal Society of London. Series B, Biological Sciences, 200 0 (1140): 0 269--294, 1978. doi:10.1098/rspb.1978.0020

  39. [39]

    DINOv2 : Learning robust visual features without supervision

    Maxime Oquab, Timoth \'e e Darcet, Th \'e o Moutakanni, Huy Vo, Marc Szafraniec, Vasil Khalidov, Pierre Fernandez, Daniel Haziza, Francisco Massa, Alaaeldin El-Nouby, et al. DINOv2 : Learning robust visual features without supervision. Transactions on Machine Learning Research, 2024. URL https://openreview.net/pdf?id=GLm1BA3C8p

  40. [40]

    Relational knowledge distillation

    Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3967--3976, 2019. URL https://openaccess.thecvf.com/content_CVPR_2019/papers/Park_Relational_Knowledge_Distillation_CVPR_2019_paper.pdf

  41. [41]

    FiLM : Visual reasoning with a general conditioning layer

    Ethan Perez, Florian Strub, Harm de Vries, Vincent Dumoulin, and Aaron Courville. FiLM : Visual reasoning with a general conditioning layer. In Proceedings of the AAAI Conference on Artificial Intelligence, 2018. doi:10.1609/aaai.v32i1.11671

  42. [42]

    DINO-MSRA : A novel network architecture for cross-view image retrieval and localization of UAV and satellite images

    Yifan Ping, Jun Lu, Haitao Guo, Qingfeng Hou, Kun Zhu, Zehao Sang, and Tong Liu. DINO-MSRA : A novel network architecture for cross-view image retrieval and localization of UAV and satellite images. Journal of Geo-information Science, 27 0 (7): 0 1608--1623, 2025. doi:10.12082/dqxxkx.2025.250051

  43. [43]

    Recent advances on jamming and spoofing detection in GNSS

    Katarina Rado s , Marta Brki \'c , and Dinko Begu s i \'c . Recent advances on jamming and spoofing detection in GNSS . Sensors, 24 0 (13): 0 4210, 2024. doi:10.3390/s24134210

  44. [44]

    2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 815–823 (2015)

    Florian Schroff, Dmitry Kalenichenko, and James Philbin. FaceNet : A unified embedding for face recognition and clustering. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2015. doi:10.1109/CVPR.2015.7298682. URL https://ieeexplore.ieee.org/document/7298682

  45. [45]

    Multi-task learning as multi-objective optimization

    Ozan Sener and Vladlen Koltun. Multi-task learning as multi-objective optimization. In Advances in Neural Information Processing Systems (NeurIPS), 2018. doi:10.48550/arXiv.1810.04650

  46. [46]

    MCCG : A ConvNeXt -based multiple-classifier method for cross-view geo-localization

    Tianrui Shen, Yingmei Wei, Lai Kang, Shanshan Wan, and Yee-Hong Yang. MCCG : A ConvNeXt -based multiple-classifier method for cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (3): 0 1456--1468, 2024. doi:10.1109/TCSVT.2023.3296074

  47. [47]

    Where am I looking at? J oint location and orientation estimation by cross-view matching

    Yujiao Shi, Xin Yu, Dylan Campbell, and Hongdong Li. Where am I looking at? J oint location and orientation estimation by cross-view matching. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Shi_Where_Am_I_Looking_At_Joint_Location_and_Orientation_E...

  48. [48]

    TirSA : A three stage approach for UAV -satellite cross-view geo-localization based on self-supervised feature enhancement

    Jian Sun, Hao Sun, Lin Lei, Kefeng Ji, and Gangyao Kuang. TirSA : A three stage approach for UAV -satellite cross-view geo-localization based on self-supervised feature enhancement. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (9): 0 7882--7895, 2024. doi:10.1109/TCSVT.2024.3382717

  49. [49]

    Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline)

    Yifan Sun, Liang Zheng, Yi Yang, Qi Tian, and Shengjin Wang. Beyond part models: Person retrieval with refined part pooling (and a strong convolutional baseline). In Proceedings of the European Conference on Computer Vision (ECCV), 2018. URL https://openaccess.thecvf.com/content_ECCV_2018/html/Yifan_Sun_Beyond_Part_Models_ECCV_2018_paper.html

  50. [50]

    Light-weight Calibrator:

    Yifan Sun, Changmao Cheng, Yuhan Zhang, Chi Zhang, Liang Zheng, Zhongdao Wang, and Yichen Wei. Circle loss: A unified perspective of pair similarity optimization. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2020. doi:10.1109/CVPR42600.2020.00643. URL https://openaccess.thecvf.com/content_CVPR_2020/html/Sun_...

  51. [51]

    Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results

    Antti Tarvainen and Harri Valpola. Mean teachers are better role models: Weight-averaged consistency targets improve semi-supervised deep learning results. In Advances in Neural Information Processing Systems (NeurIPS), 2017. URL https://proceedings.neurips.cc/paper/2017/hash/5a61e2356a4a14f2a8c4e1a4c4c7e26a-Abstract.html

  52. [52]

    Contrastive representation distillation

    Yonglong Tian, Dilip Krishnan, and Phillip Isola. Contrastive representation distillation. In International Conference on Learning Representations (ICLR), 2020. URL https://openreview.net/pdf?id=SkgpBJrtvS

  53. [53]

    Representation Learning with Contrastive Predictive Coding

    A \"a ron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding. In Advances in Neural Information Processing Systems (NeurIPS), 2018. doi:10.48550/arXiv.1807.03748

  54. [54]

    Gomez, Lukasz Kaiser, and Illia Polosukhin

    Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Information Processing Systems (NeurIPS), 2017. URL https://proceedings.neurips.cc/paper/2017/hash/3f5ee243547dee91fbd053c1c4a845aa-Abstract.html

  55. [55]

    Graph Attention Networks

    Petar Veli c kovi \'c , Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li \`o , and Yoshua Bengio. Graph attention networks. In Proceedings of the International Conference on Learning Representations (ICLR), 2018. doi:10.48550/arXiv.1710.10903. URL https://openreview.net/forum?id=rJXMpikCZ

  56. [56]

    Tent: Fully test-time adaptation by entropy minimization.arXiv preprint arXiv:2006.10726,

    Dequan Wang, Evan Shelhamer, Shaoteng Liu, Bruno Olshausen, and Trevor Darrell. Tent: Fully test-time adaptation by entropy minimization. In Proceedings of the International Conference on Learning Representations (ICLR), 2021. doi:10.48550/arXiv.2006.10726

  57. [57]

    Each part matters: Local patterns facilitate cross-view geo-localization

    Tingyu Wang, Zhedong Zheng, Chenggang Yan, Jiyong Zhang, Yaoqi Sun, Bolun Zheng, and Yi Yang. Each part matters: Local patterns facilitate cross-view geo-localization. IEEE Transactions on Circuits and Systems for Video Technology, 32 0 (2): 0 867--879, 2022. doi:10.1109/TCSVT.2021.3061265

  58. [58]

    Multiple-environment self-adaptive network for aerial-view geo-localization

    Tingyu Wang, Zhedong Zheng, Yaoqi Sun, Chenggang Yan, Yi Yang, and Tat-Seng Chua. Multiple-environment self-adaptive network for aerial-view geo-localization. Pattern Recognition, 152: 0 110363, 2024. doi:10.1016/j.patcog.2024.110363

  59. [59]

    Understanding contrastive representation learning through alignment and uniformity on the hypersphere.ArXiv, abs/2005.10242, 2020

    Tongzhou Wang and Phillip Isola. Understanding contrastive representation learning through alignment and uniformity on the hypersphere. In International Conference on Machine Learning (ICML), 2020. doi:10.48550/arXiv.2005.10242

  60. [60]

    Weatherprompt: Multi-modality representation learning for all-weather drone visual geo-localization

    Jiahao Wen, Hang Yu, and Zhedong Zheng. Weatherprompt: Multi-modality representation learning for all-weather drone visual geo-localization. In Advances in Neural Information Processing Systems (NeurIPS), 2025. URL https://nips.cc/virtual/2025/poster/118002

  61. [61]

    Wide-area image geolocalization with aerial reference imagery

    Scott Workman, Richard Souvenir, and Nathan Jacobs. Wide-area image geolocalization with aerial reference imagery. In Proceedings of the IEEE International Conference on Computer Vision (ICCV), 2015. URL https://openaccess.thecvf.com/content_iccv_2015/html/Workman_Wide-Area_Image_Geolocalization_ICCV_2015_paper.html

  62. [62]

    CAMP : A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning

    Qiong Wu, Yi Wan, Zhi Zheng, Yongjun Zhang, Guangshuai Wang, and Zhenyang Zhao. CAMP : A cross-view geo-localization method using contrastive attributes mining and position-aware partitioning. IEEE Transactions on Geoscience and Remote Sensing, 62: 0 1--14, 2024. doi:10.1109/TGRS.2024.3448499

  63. [63]

    Enhancing cross-view geo-localization with domain alignment and scene consistency

    Panwang Xia, Yi Wan, Zhi Zheng, Yongjun Zhang, and Jiwei Deng. Enhancing cross-view geo-localization with domain alignment and scene consistency. IEEE Transactions on Circuits and Systems for Video Technology, 34 0 (12): 0 13271--13281, 2024. doi:10.1109/TCSVT.2024.3443510

  64. [64]

    Enhancing cross view geo localization through global local quadrant interaction network

    Jin Xu, Junping Yin, Juan Zhang, and Tianyan Gao. Enhancing cross view geo localization through global local quadrant interaction network. Scientific Reports, 15: 0 33431, 2025 a . doi:10.1038/s41598-025-18935-6

  65. [65]

    Precise gps-denied uav self-positioning via context-enhanced cross-view geo-localization

    Yuanze Xu, Ming Dai, Wenxiao Cai, and Wankou Yang. Precise gps-denied uav self-positioning via context-enhanced cross-view geo-localization. In Chinese Conference on Pattern Recognition and Computer Vision (PRCV), pages 374--388. Springer, 2025 b . doi:10.1007/978-981-95-5628-1_26

  66. [66]

    DINOv2 -based UAV visual self-localization in low-altitude urban environments

    Jiaqiang Yang, Danyang Qin, Huapeng Tang, Sili Tao, Haoze Bie, and Lin Ma. DINOv2 -based UAV visual self-localization in low-altitude urban environments. IEEE Robotics and Automation Letters, 10 0 (2): 0 2080--2087, 2025. doi:10.1109/LRA.2025.3527762

  67. [67]

    Exploring the best way for UAV visual localization under Low-altitude Multi-view Observation Condition: a Benchmark

    Yibin Ye, Xichao Teng, Shuo Chen, Zhang Li, Leqi Liu, Qifeng Yu, and Tao Tan. Exploring the best way for UAV visual localization under low-altitude multi-view observation condition: A benchmark. In Findings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2026. doi:10.48550/arXiv.2503.10692

  68. [68]

    University-1652 : A multi-view multi-source benchmark for drone-based geo-localization

    Zhedong Zheng, Yunchao Wei, and Yi Yang. University-1652 : A multi-view multi-source benchmark for drone-based geo-localization. In Proceedings of the 28th ACM International Conference on Multimedia (ACM MM), pages 1395--1403, 2020. doi:10.1145/3394171.3413896

  69. [69]

    iBOT : Image BERT pre-training with online tokenizer

    Jinghao Zhou, Chen Wei, Huiyu Wang, Wei Shen, Cihang Xie, Alan Yuille, and Tao Kong. iBOT : Image BERT pre-training with online tokenizer. In Proceedings of the International Conference on Learning Representations (ICLR), 2022. URL https://openreview.net/pdf?id=ydopy-e6Dg

  70. [70]

    SUES-200 : A multi-height multi-scene cross-view image benchmark across drone and satellite

    Runzhe Zhu, Ling Yin, Mingze Yang, Fei Wu, Yuncheng Yang, and Wenbo Hu. SUES-200 : A multi-height multi-scene cross-view image benchmark across drone and satellite. IEEE Transactions on Circuits and Systems for Video Technology, 33 0 (9): 0 4825--4839, 2023. doi:10.1109/TCSVT.2023.3249204