pith. machine review for the scientific record. sign in

arxiv: 2604.09445 · v1 · submitted 2026-04-10 · 💻 cs.CV

Recognition: 2 theorem links

· Lean Theorem

AsymLoc: Towards Asymmetric Feature Matching for Efficient Visual Localization

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:04 UTC · model grok-4.3

classification 💻 cs.CV
keywords visual localizationasymmetric feature matchingknowledge distillationfeature alignmentnearest neighbor matchingedge devicesefficient models
0
0 comments X

The pith

A small student model reaches up to 95% of a large teacher's accuracy in visual localization by aligning features for simple nearest-neighbor matching.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an asymmetric visual localization system in which a large teacher model processes database images offline while a lightweight student model processes the query image online. To overcome the mismatch in features from the two models, the authors introduce a distillation framework that combines a geometry-driven matching objective with joint detector-descriptor distillation. This alignment makes it possible to use fast, parameter-free nearest-neighbor matching instead of complex learned components. If the approach holds, it delivers near-teacher performance on standard benchmarks with models an order of magnitude smaller, which would enable precise real-time localization on battery- and heat-constrained edge devices such as smart glasses.

Core claim

AsymLoc is a distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching. Experiments on HPatches, ScanNet, IMC2022, and Aachen show that it achieves up to 95% of the teacher's localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines.

What carries the argument

Asymmetric distillation framework that uses a geometry-driven matching objective together with joint detector-descriptor distillation to align teacher and student features for nearest-neighbor matching.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same alignment objectives could be applied to other matching-heavy tasks such as image retrieval or stereo reconstruction to reduce reliance on learned matchers.
  • Testing the framework when the student architecture differs more radically from the teacher would reveal how robust the geometry-driven objectives are to architectural gaps.
  • Combining this distillation with model quantization or pruning could produce even smaller students while preserving the reported accuracy levels.

Load-bearing premise

The geometry-driven matching objective combined with joint detector-descriptor distillation can align features from the teacher and student models well enough to support accurate parameter-less nearest-neighbor matching.

What would settle it

Evaluating the distilled student on a new dataset with large domain shift where localization accuracy falls below 70% of the teacher's would show that the alignment is not sufficient.

Figures

Figures reproduced from arXiv: 2604.09445 by Eric Foxlin, Gabriele Berton, Mohammad Omama, Yelin Kim.

Figure 1
Figure 1. Figure 1: AsymLoc bridges the gap between powerful database models and lightweight on-device localization. By explicitly modeling teacher–student asymmetry, AsymLoc enables compact query models to perform real-time localization on edge platforms such as smart glasses, drones, and single-board computers, while larger teacher models process the pre-mapped database im￾ages offline. This design delivers near-teacher acc… view at source ↗
Figure 2
Figure 2. Figure 2: AsymLoc Training Pipeline. Given a pair of images (A, B) with known homography, the teacher model T processes image A, while image B is processed by both the teacher T and the student S. Each network produces N keypoints with corresponding detector confidence and descriptors. The teacher outputs from A and the student outputs from B are combined to form the Mutual Matching Matrix (Sec. 3.2), which is used … view at source ↗
Figure 3
Figure 3. Figure 3: Examples from the evaluation datasets, spanning planar [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: AsymLoc student–teacher asymmetric matching visualization. Symmetric student–student matching fails, whereas asym￾metric student–teacher matching succeeds and closely reproduces the teacher–teacher correspondences. COCO and generate a second view by applying a random homographic transformation, yielding a pair (a, b) with known ground-truth homography. For each training pair (a, b), we use the known homogr… view at source ↗
Figure 5
Figure 5. Figure 5: Efficiency–accuracy trade-offs for AsymLoc. (A) Homography estimation accuracy (HE Acc) vs. GFLOPs on HPatches. (B) HE Acc per GFLOP vs. parameter count. (C) Mean localization accuracy (MLA) vs. GFLOPs on IMC2022. (D) MLA per GFLOP vs. parameter count. Across all datasets, asymmetric training yields flatter Pareto curves and higher parameter efficiency, demonstrating superior scalability of AsymLoc compare… view at source ↗
Figure 6
Figure 6. Figure 6: Homography estimation accuracy on HPatches with a [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: FPS Comparison on HPatches with SILK Teacher [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗
read the original abstract

Precise and real-time visual localization is critical for applications like AR/VR and robotics, especially on resource-constrained edge devices such as smart glasses, where battery life and heat dissipation can be a primary concerns. While many efficient models exist, further reducing compute without sacrificing accuracy is essential for practical deployment. To address this, we propose asymmetric visual localization: a large Teacher model processes pre-mapped database images offline, while a lightweight Student model processes the query image online. This creates a challenge in matching features from two different models without resorting to heavy, learned matchers. We introduce AsymLoc, a novel distillation framework that aligns a Student to its Teacher through a combination of a geometry-driven matching objective and a joint detector-descriptor distillation objective, enabling fast, parameter-less nearest-neighbor matching. Extensive experiments on HPatches, ScanNet, IMC2022, and Aachen show that AsymLoc achieves up to 95% of the teacher's localization accuracy using an order of magnitude smaller models, significantly outperforming existing baselines and establishing a new state-of-the-art efficiency-accuracy trade-off.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper introduces AsymLoc, a distillation framework for asymmetric visual localization in which a large teacher model processes database images offline and a lightweight student model processes query images online. Alignment is achieved via a geometry-driven matching objective combined with joint detector-descriptor distillation, enabling parameter-less nearest-neighbor matching between the two models. Experiments across HPatches, ScanNet, IMC2022, and Aachen report that the student reaches up to 95% of teacher localization accuracy at roughly 10x smaller model size while outperforming prior baselines and establishing a new efficiency-accuracy trade-off.

Significance. If the empirical results hold, the work would be significant for practical deployment of visual localization on edge devices such as AR/VR headsets and mobile robots, where compute, power, and heat constraints are severe. The asymmetric teacher-student design with direct NN matching avoids heavy learned matchers and is supported by evaluation on four diverse datasets (HPatches, ScanNet, IMC2022, Aachen), which provides reasonably broad empirical grounding for the efficiency claims.

major comments (2)
  1. [§3.2] §3.2 (Distillation Objectives): the geometry-driven matching loss is asserted to produce descriptor spaces that support direct NN matching without learned components, yet the manuscript does not provide an explicit analysis or ablation showing that the student-teacher feature distributions are sufficiently aligned (e.g., no cosine-similarity histograms or nearest-neighbor recall curves between teacher and student descriptors). This is load-bearing for the central claim that parameter-less matching suffices.
  2. [§4.3] §4.3, Table 3 (Aachen day-night results): the reported 95% relative accuracy figure is given without per-sequence breakdowns, standard deviations across runs, or comparison against the teacher under identical RANSAC settings; without these, it is impossible to judge whether the efficiency gain is robust or dataset-specific.
minor comments (3)
  1. [§3.2] Notation for the joint detector-descriptor loss is introduced without a compact equation reference; adding a single boxed equation summarizing L_det + L_desc would improve readability.
  2. [§4.1] The abstract and introduction repeatedly use “order of magnitude smaller” without stating the exact parameter counts or FLOPs of the teacher and student backbones; a small table in §4.1 would clarify this.
  3. Qualitative figures showing successful and failure-case matches between teacher and student descriptors would help readers understand the limits of the asymmetric alignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive evaluation of the work's significance for edge-device visual localization. We address each major comment below with clarifications and commitments to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Distillation Objectives): the geometry-driven matching loss is asserted to produce descriptor spaces that support direct NN matching without learned components, yet the manuscript does not provide an explicit analysis or ablation showing that the student-teacher feature distributions are sufficiently aligned (e.g., no cosine-similarity histograms or nearest-neighbor recall curves between teacher and student descriptors). This is load-bearing for the central claim that parameter-less matching suffices.

    Authors: We agree that direct evidence of descriptor alignment would strengthen the central claim. While the high localization accuracy achieved via parameter-less NN matching across four diverse benchmarks (HPatches, ScanNet, IMC2022, Aachen) provides strong indirect validation, we will add explicit analysis in the revision. Specifically, we will include cosine-similarity histograms for teacher-teacher, student-student, and cross teacher-student descriptor pairs on a held-out set, along with nearest-neighbor recall curves at varying distance thresholds. These will be placed in §3.2 or a new supplementary section to demonstrate the alignment induced by the geometry-driven matching objective. revision: yes

  2. Referee: [§4.3] §4.3, Table 3 (Aachen day-night results): the reported 95% relative accuracy figure is given without per-sequence breakdowns, standard deviations across runs, or comparison against the teacher under identical RANSAC settings; without these, it is impossible to judge whether the efficiency gain is robust or dataset-specific.

    Authors: We thank the referee for this suggestion to improve transparency. The 95% figure is an aggregate over the Aachen day and night sequences. In the revised Table 3 we will report separate day and night results with per-sequence breakdowns. We will also rerun the evaluation with multiple RANSAC random seeds (e.g., 5 runs) and report mean and standard deviation for both teacher and student to quantify variance. Finally, we will explicitly confirm and document that all teacher and student results use identical RANSAC hyperparameters (inlier threshold, maximum iterations, etc.) for direct comparability. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper introduces AsymLoc as a novel distillation framework that combines a geometry-driven matching objective with joint detector-descriptor distillation to enable parameter-less nearest-neighbor matching between a large teacher model (database) and lightweight student model (query). The central claims of achieving up to 95% of teacher localization accuracy with an order of magnitude smaller models are supported directly by empirical results on HPatches, ScanNet, IMC2022, and Aachen rather than by any reduction to self-definitions, fitted parameters renamed as predictions, or load-bearing self-citations. No equations or prior-work ansatzes are shown to collapse the derivation chain into its inputs by construction; the method is presented as a self-contained proposal with independent experimental validation of the efficiency-accuracy trade-off.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The abstract does not specify any free parameters, axioms, or invented entities; the method relies on standard distillation techniques adapted to geometry-driven objectives.

pith-pipeline@v0.9.0 · 5494 in / 1196 out tokens · 57076 ms · 2026-05-10T18:04:16.429378+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

73 extracted references · 3 canonical work pages · 3 internal anchors

  1. [1]

    Variational information dis- tillation for knowledge transfer

    Sungsoo Ahn, Shell Xu Hu, Andreas Damianou, Neil Lawrence, and Zhenwen Dai. Variational information dis- tillation for knowledge transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 9163–9171, 2019. 3

  2. [2]

    Arandjelovi’c, P

    R. Arandjelovi’c, P. Gronat, A. Torii, T. Pajdla, and J. Sivic. NetVLAD: CNN architecture for weakly supervised place recognition. InIEEE Conference on Computer Vision and Patter Recognition (CVPR), 2016. 2

  3. [3]

    Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors

    Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. InProceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5173–5182, 2017. 5

  4. [4]

    Megaloc: One retrieval to place them all

    Gabriele Berton and Carlo Masone. Megaloc: One retrieval to place them all. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) Workshops, pages 2861–2867, 2025. 2

  5. [5]

    Re- thinking visual geo-localization for large-scale applications

    Gabriele Berton, Carlo Masone, and Barbara Caputo. Re- thinking visual geo-localization for large-scale applications. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022. 2

  6. [6]

    Earthmatch: Iterative coregistration for fine-grained localization of astro- naut photography

    Gabriele Berton, Gabriele Goletto, Gabriele Trivigno, Alex Stoken, Barbara Caputo, and Carlo Masone. Earthmatch: Iterative coregistration for fine-grained localization of astro- naut photography. InProceedings of the IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR) Workshops, 2024. 2

  7. [7]

    Crocodl: Cross-device collaborative dataset for local- ization

    Hermann Blum, Alessandro Mercurio, Joshua O’Reilly, Tim Engelbracht, Mihai Dusmanu, Marc Pollefeys, and Zuria Bauer. Crocodl: Cross-device collaborative dataset for local- ization. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 27424–27434, 2025. 1

  8. [8]

    A case for using rotation invariant features in state of the art feature matchers

    Georg B ¨okman and Fredrik Kahl. A case for using rotation invariant features in state of the art feature matchers. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 5110–5119, 2022. 3

  9. [9]

    Asymmetric met- ric learning for knowledge transfer

    Mateusz Budnik and Yannis Avrithis. Asymmetric met- ric learning for knowledge transfer. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 8228–8238, 2021. 2

  10. [10]

    Asymmetric metric learning for knowledge transfer

    Mateusz Budnik and Yannis Avrithis. Asymmetric metric learning for knowledge transfer. InCVPR, 2021. 3, 6, 7

  11. [11]

    Rdd: Robust feature detector and descriptor using deformable transformer

    Gonglin Chen, Tianwen Fu, Haiwei Chen, Wenbin Teng, Hanyuan Xiao, and Yajie Zhao. Rdd: Robust feature detector and descriptor using deformable transformer. InProceedings of the Computer Vision and Pattern Recognition Conference (CVPR), pages 6394–6403, 2025. 3

  12. [12]

    Learning to Match Features with Seeded Graph Matching Network

    Hongkai Chen, Zixin Luo, Jiahui Zhang, Lei Zhou, Xuyang Bai, Zeyu Hu, Chiew-Lan Tai, and Long Quan. Learning to Match Features with Seeded Graph Matching Network. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 1550–1559, 2021. 2

  13. [13]

    Aspanformer: Detector-free image matching with adaptive span transformer

    Hongkai Chen, Zixin Luo, Lei Zhou, Yurun Tian, Ming- min Zhen, Tian Fang, David Mckinnon, Yanghai Tsin, and Long Quan. Aspanformer: Detector-free image matching with adaptive span transformer. InEuropean conference on computer vision, pages 20–36. Springer, 2022. 3

  14. [14]

    Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner

    Angela Dai, Angel X. Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 5828–5839, 2017. 5

  15. [15]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. InCVPR Workshops, 2018. 2, 6

  16. [16]

    Compatibility- aware heterogeneous visual search

    Rahul Duggal, Hao Zhou, Shuo Yang, Yuanjun Xiong, Wei Xia, Zhuowen Tu, and Stefano Soatto. Compatibility- aware heterogeneous visual search. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 10723–10732, 2021. 2

  17. [17]

    Compatibility-aware heterogeneous visual search

    Shivansh Duggal, Xiaojun Wu, and Saurabh Mittal. Compatibility-aware heterogeneous visual search. InCVPR,

  18. [18]

    D2- net: A trainable cnn for joint description and detection of local features

    Mihai Dusmanu, Ignacio Rocco, Tomas Pajdla, Marc Polle- feys, Josef Sivic, Akihiko Torii, and Torsten Sattler. D2- net: A trainable cnn for joint description and detection of local features. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 8092–8101, 2019. 2

  19. [19]

    Roma: Robust dense fea- ture matching

    Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. Roma: Robust dense fea- ture matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790– 19800, 2024. 3

  20. [20]

    Silk: Simple learned keypoints

    Pierre Gleize, Weiyao Wang, and Matt Feiszli. Silk: Simple learned keypoints. InProceedings of the IEEE/CVF interna- tional conference on computer vision, pages 22499–22508,

  21. [21]

    Distilling the Knowledge in a Neural Network

    Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network.arXiv preprint arXiv:1503.02531, 2015. 3

  22. [22]

    MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

    Andrew G. Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. Mobilenets: Efficient convolu- tional neural networks for mobile vision applications.arXiv preprint arXiv:1704.04861, 2017. 1

  23. [23]

    Towards visual feature translation

    Jie Hu, Rongrong Ji, Hong Liu, Shengchuan Zhang, Cheng Deng, and Qi Tian. Towards visual feature translation. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 3004–3013, 2019. 2

  24. [24]

    Towards visual feature translation

    Jie Hu, Rongrong Ji, Hong Liu, Shengchuan Zhang, Cheng Deng, and Qi Tian. Towards visual feature translation. In CVPR, 2019. 3

  25. [25]

    Optimal transport ag- gregation for visual place recognition

    Sergio Izquierdo and Javier Civera. Optimal transport ag- gregation for visual place recognition. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2024. 2

  26. [26]

    OmniGlue: Generalizable Feature Matching with Foundation Model Guidance

    Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Andr ´e Araujo. OmniGlue: Generalizable Feature Matching with Foundation Model Guidance. InProceed- 9 ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 20719–20729, 2024. 2

  27. [27]

    Edm: Equirect- angular projection-oriented dense kernelized feature match- ing

    Dongki Jung, Jaehoon Choi, Yonghan Lee, Somi Jeong, Tae- jae Lee, Dinesh Manocha, and Suyong Yeon. Edm: Equirect- angular projection-oriented dense kernelized feature match- ing. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 6337–6347, 2025. 3

  28. [28]

    Adam: A Method for Stochastic Optimization

    Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization.CoRR, abs/1412.6980, 2014. 6

  29. [29]

    Efficient loftr: Efficient local feature match- ing with transformers

    Wei Li and et al. Efficient loftr: Efficient local feature match- ing with transformers. InECCV, 2022. 3

  30. [30]

    Megadepth: Learning single- view depth prediction from internet photos

    Zhengqi Li and Noah Snavely. Megadepth: Learning single- view depth prediction from internet photos. InComputer Vision and Pattern Recognition (CVPR), 2018. 5

  31. [31]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. InEu- ropean Conference on Computer Vision (ECCV), pages 740–

  32. [32]

    Lightglue: Local feature matching at light speed

    Philipp Lindenberger, Paul-Edouard Sarlin, Marc Pollefeys, and Mihai Dusmanu. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 18448–18458, 2023. 2

  33. [33]

    Efficient global 2d- 3d matching for camera localization in a large-scale 3d map

    Liu Liu, Hongdong Li, and Yuchao Dai. Efficient global 2d- 3d matching for camera localization in a large-scale 3d map. InCVPR, 2017. 2

  34. [34]

    ContextDesc: Lo- cal Descriptor Augmentation with Cross-Modality Context

    Zixin Luo, Tianwei Shen, Lei Zhou, Jiahui Zhang, Yao Yao, Shiwei Li, Tian Fang, and Long Quan. ContextDesc: Lo- cal Descriptor Augmentation with Cross-Modality Context. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

  35. [35]

    Working hard to know your neighbor’s mar- gins: Local descriptor learning loss

    Anastasiia Mishchuk, Dmytro Mishkin, Filip Radenovic, and Jiri Matas. Working hard to know your neighbor’s mar- gins: Local descriptor learning loss. InAdvances in Neural Information Processing Systems (NeurIPS), 2017. 2

  36. [36]

    Relational knowledge distillation

    Wonpyo Park, Dongju Kim, Yan Lu, and Minsu Cho. Relational knowledge distillation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 3967–3976, 2019. 3, 6, 7

  37. [37]

    Image matching challenge 2022: Summary and results

    (Kaggle / CVPR Workshop Participants). Image matching challenge 2022: Summary and results. InCVPR Workshop on Image Matching: Local Features & Beyond, 2022. 5

  38. [38]

    Xfeat: Accelerated fea- tures for lightweight image matching

    Guilherme Potje, Felipe Cadar, Andr ´e Araujo, Renato Mar- tins, and Erickson R Nascimento. Xfeat: Accelerated fea- tures for lightweight image matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 2682–2691, 2024. 5, 6

  39. [39]

    Minima: Modality invariant im- age matching

    Jiangwei Ren, Xingyu Jiang, Zizhuo Li, Dingkang Liang, Xin Zhou, and Xiang Bai. Minima: Modality invariant im- age matching. InProceedings of the Computer Vision and Pattern Recognition Conference, pages 23059–23068, 2025. 3

  40. [40]

    R2d2: Reliable and repeatable detector and descriptor

    Jerome Revaud, Claudio de Souza, Martin Humenberger, and Philippe Weinzaepfel. R2d2: Reliable and repeatable detector and descriptor. InNeurIPS, 2019. 2

  41. [41]

    Fit- nets: Hints for thin deep nets

    Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fit- nets: Hints for thin deep nets. InInternational Conference on Learning Representations (ICLR), 2015. 3

  42. [42]

    Mobilenetv2: In- verted residuals and linear bottlenecks

    Mark Sandler, Andrew Howard, Menglong Zhu, Andrey Zhmoginov, and Liang-Chieh Chen. Mobilenetv2: In- verted residuals and linear bottlenecks. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 4510–4520, 2018. 1

  43. [43]

    Siegwart, and C ´esar Cadena

    Paul-Edouard Sarlin, Fr’ed’eric Debraine, Marcin Dymczyk, Roland Y . Siegwart, and C ´esar Cadena. Leveraging deep visual descriptors for hierarchical efficient localization. In Conference on Robot Learning, 2018. 1

  44. [44]

    From coarse to fine: Robust hierarchical localization at large scale

    Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. InCVPR, 2019. 2, 5

  45. [45]

    From coarse to fine: Robust hierarchical localization at large scale

    Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. InCVPR, pages 12716–12725,

  46. [46]

    Superglue: Learning feature matching with graph neural networks

    Paul-Edouard Sarlin, Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InCVPR, 2020. 2, 5

  47. [47]

    Sch¨onberger, Pablo Speciale, Lukas Gruber, Viktor Larsson, Ondrej Miksik, and Marc Pollefeys

    Paul-Edouard Sarlin, Mihai Dusmanu, Johannes L. Sch¨onberger, Pablo Speciale, Lukas Gruber, Viktor Larsson, Ondrej Miksik, and Marc Pollefeys. LaMAR: Benchmark- ing Localization and Mapping for Augmented Reality. In ECCV, 2022. 1

  48. [48]

    Efficient & effective prioritized matching for large-scale image-based localization.IEEE PAMI, 39(9):1744–1756, 2017

    Torsten Sattler, Bastian Leibe, and Leif Kobbelt. Efficient & effective prioritized matching for large-scale image-based localization.IEEE PAMI, 39(9):1744–1756, 2017. 2

  49. [49]

    Are large-scale 3d models really necessary for accurate visual lo- calization? In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6175–6184, 2017

    Torsten Sattler, Akihiko Torii, Josef Sivic, Marc Pollefeys, Hajime Taira, Masatoshi Okutomi, and Tomas Pajdla. Are large-scale 3d models really necessary for accurate visual lo- calization? In2017 IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pages 6175–6184, 2017. 2

  50. [50]

    Benchmarking 6dof outdoor visual localization in changing conditions

    Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Lars Hammarstrand, Erik Stenborg, Daniel Safari, Masatoshi Okutomi, Marc Pollefeys, Josef Sivic, et al. Benchmarking 6dof outdoor visual localization in changing conditions. In CVPR, 2018. 1

  51. [51]

    Benchmarking 6dof outdoor visual localization in changing conditions

    Torsten Sattler, Will Maddern, Carl Toft, Akihiko Torii, Josef Sivic, Fredrik Kahl, Masatoshi Okutomi, Marc Pollefeys, Tomas Pajdla, Lars Hammarstrand, Erik Stenborg, David Sa- fari, Tommaso Cavallari, Luigi Di Stefano, Andrea Torsello, Dmytro Mishkin, Jiri Matas, Marc Pollefeeys, and Linus Svarm. Benchmarking 6dof outdoor visual localization in changing ...

  52. [52]

    Towards backward-compatible repre- sentation learning

    Yujun Shen and et al. Towards backward-compatible repre- sentation learning. InCVPR, 2020. 3

  53. [53]

    Towards backward-compatible representation learning

    Yantao Shen, Yuanjun Xiong, Wei Xia, and Stefano Soatto. Towards backward-compatible representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 6368–6377, 2020. 2 10

  54. [54]

    Ames: Asymmetric and memory-efficient similarity estimation for instance-level retrieval

    Pavel Suma, Giorgos Kordopatis-Zilos, Ahmet Iscen, and Giorgos Tolias. Ames: Asymmetric and memory-efficient similarity estimation for instance-level retrieval. InEuropean Conference on Computer Vision, pages 307–325. Springer,

  55. [55]

    Loftr: Detector-free local feature matching with transformers

    Jiaming Sun, Zehong Shen, Yuang Wang, Hang Bao, Xi- aowei Zhou, and Ping Luo. Loftr: Detector-free local feature matching with transformers. InCVPR, 2021. 3, 5

  56. [56]

    City-scale localization for cameras with known ver- tical direction.IEEE PAMI, 39(7):1455–1461, 2017

    Linus Sv ¨arm, Olof Enqvist, Fredrik Kahl, and Magnus Os- karsson. City-scale localization for cameras with known ver- tical direction.IEEE PAMI, 39(7):1455–1461, 2017. 2

  57. [57]

    Inloc: Indoor visual localization with dense matching and view synthesis

    Hajime Taira, Masatoshi Okutomi, Torsten Sattler, Mircea Cimpoi, Marc Pollefeys, Josef Sivic, Tomas Pajdla, and Ak- ihiko Torii. Inloc: Indoor visual localization with dense matching and view synthesis. InCVPR, 2018. 2

  58. [58]

    Mingxing Tan and Quoc V . Le. Efficientnet: Rethinking model scaling for convolutional neural networks. InPro- ceedings of the 36th International Conference on Machine Learning (ICML), pages 6105–6114, 2019. 1

  59. [59]

    L2-Net: Deep Learn- ing of Discriminative Patch Descriptor in Euclidean Space

    Yurun Tian, Bin Fan, and Fuchao Wu. L2-Net: Deep Learn- ing of Discriminative Patch Descriptor in Euclidean Space. InProceedings of the IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017. 2

  60. [60]

    SOSNet: Second Order Similarity Regularization for Local Descriptor Learning

    Yurun Tian, Xin Yu, Bin Fan, Fuchao Wu, Huub Heijnen, and Vassileios Balntas. SOSNet: Second Order Similarity Regularization for Local Descriptor Learning. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2019. 2

  61. [61]

    Con- trastive representation distillation

    Yonglong Tian, Dilip Krishnan, and Phillip Isola. Con- trastive representation distillation. InInternational Confer- ence on Learning Representations (ICLR), 2020. 3

  62. [62]

    Seman- tic match consistency for long-term visual localization

    Carl Toft, Erik Stenborg, Lars Hammarstrand, Lucas Brynte, Marc Pollefeys, Torsten Sattler, and Fredrik Kahl. Seman- tic match consistency for long-term visual localization. In ECCV, 2O18. 2

  63. [63]

    DISK: Learning local features with policy gradient

    Michal Tyszkiewicz, Pascal Fua, and Eduard Trulls. DISK: Learning local features with policy gradient. InAdvances in Neural Information Processing Systems (NeurIPS), 2020. 2

  64. [64]

    Matchformer: Interleaving attention in transformers for feature matching

    Qing Wang, Jiaming Zhang, Kailun Yang, Kunyu Peng, and Rainer Stiefelhagen. Matchformer: Interleaving attention in transformers for feature matching. InProceedings of the Asian conference on computer vision, pages 2746–2762,

  65. [65]

    Contextual similarity distillation for asymmetric image retrieval

    Hui Wu, Min Wang, Wengang Zhou, Houqiang Li, and Qi Tian. Contextual similarity distillation for asymmetric image retrieval. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9489–9498,

  66. [66]

    Contextual similarity distillation for asymmetric image retrieval

    Xiaohang Wu and et al. Contextual similarity distillation for asymmetric image retrieval. InCVPR, 2022. 3, 6

  67. [67]

    D3still: Decoupled differential distil- lation for asymmetric image retrieval

    Luchen Xie and et al. D3still: Decoupled differential distil- lation for asymmetric image retrieval. InCVPR, 2024. 3, 6

  68. [68]

    D3still: Decoupled differential distillation for asymmetric image retrieval

    Yi Xie, Yihong Lin, Wenjie Cai, Xuemiao Xu, Huaidong Zhang, Yong Du, and Shengfeng He. D3still: Decoupled differential distillation for asymmetric image retrieval. In Proceedings of the IEEE/CVF Conference on Computer Vi- sion and Pattern Recognition, pages 17181–17190, 2024. 2, 7

  69. [69]

    LIFT: Learned Invariant Feature Transform

    Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. LIFT: Learned Invariant Feature Transform. InEuro- pean Conference on Computer Vision (ECCV), 2016. 2

  70. [70]

    A gift from knowledge distillation: Fast optimization, network minimization and transfer learning

    Junho Yim, Donggyu Joo, Jihoon Bae, and Junmo Kim. A gift from knowledge distillation: Fast optimization, network minimization and transfer learning. InProceedings of the IEEE Conference on Computer Vision and Pattern Recogni- tion (CVPR), pages 4133–4141, 2017. 3

  71. [71]

    Paying more at- tention to attention: Improving the performance of convolu- tional neural networks via attention transfer

    Sergey Zagoruyko and Nikos Komodakis. Paying more at- tention to attention: Improving the performance of convolu- tional neural networks via attention transfer. InInternational Conference on Learning Representations (ICLR), 2017. 3

  72. [72]

    Scaling vision transformers

    Xiaohua Zhai, Alexander Kolesnikov, Neil Houlsby, and Lu- cas Beyer. Scaling vision transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12104–12113, 2022. 1

  73. [73]

    Xiaoming Zhao, Xingming Wu, Jinyu Miao, Weihai Chen, Peter C. Y . Chen, and Zhengguo Li. Alike: Accurate and lightweight keypoint detection and descriptor extraction. IEEE Transactions on Multimedia, 2022. 2 11 A. Appendix A.1. Hyperparameter Ablations In equation 10, we defined detector-weighted similarity ma- trices as: ¯SST ij = wS i τs SST ij wT j τt ...