Deploy DINO with Many-to-Many Association

Haodong Jiang; Junfeng Wu; Mingzhe Li

arxiv: 2604.23670 · v1 · submitted 2026-04-26 · 💻 cs.CV

Deploy DINO with Many-to-Many Association

Haodong Jiang , Mingzhe Li , Junfeng Wu This is my paper

Pith reviewed 2026-05-08 06:29 UTC · model grok-4.3

classification 💻 cs.CV

keywords DINOmany-to-many matchingimage matchingzero-shotcamera pose estimationHarmonic Consensus Maximizationout-of-distribution generalizationrobust estimation

0 comments

The pith

General DINO features compete with specialized matching models on out-of-distribution datasets using many-to-many association and HCM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that DINO's general visual features can be deployed for image matching without adaptation. They address the ambiguity in matching similar objects by using many-to-many associations rather than strict one-to-one. To handle robustness efficiently under this paradigm, they develop Harmonic Consensus Maximization as a faster approximation to finding maximum-cardinality matchings. This setup lets the out-of-the-box features perform competitively with specialized models on unseen domains for camera pose estimation.

Core claim

Adopting many-to-many association for DINO features to manage inherent ambiguity on semantically similar instances, and introducing Harmonic Consensus Maximization as a likelihood-based efficient robust mechanism, allows these general-purpose features to achieve performance comparable to specialized matching models on out-of-distribution datasets in downstream tasks such as camera pose estimation.

What carries the argument

Harmonic Consensus Maximization (HCM), which provides a faster and finer-grained robust estimation by interpreting the problem from a likelihood perspective instead of computing maximum-cardinality matchings for each parameter hypothesis.

Load-bearing premise

The assumption that DINO features require a many-to-many paradigm because of ambiguity on similar instances and that HCM's likelihood approximation delivers equivalent robustness for tasks like camera pose estimation.

What would settle it

Running camera pose estimation experiments on multiple out-of-distribution datasets comparing accuracy and runtime of DINO with m-to-m plus HCM against specialized matching models; superior or equal performance on accuracy with better efficiency would confirm the claim.

Figures

Figures reproduced from arXiv: 2604.23670 by Haodong Jiang, Junfeng Wu, Mingzhe Li.

**Figure 1.** Figure 1: This figure illustrates the inherent ambiguity in establishing geometric correspondence using semantic-rich DINOv3 view at source ↗

**Figure 2.** Figure 2: An inlier under the ground-truth parameter is not view at source ↗

**Figure 3.** Figure 3: A toy example for likelihood calculation. Orange view at source ↗

**Figure 4.** Figure 4: Group precision for MKNN test with K = 3, 5, 8 and an error threshold of 5 pixels. The Easy, Average, and Hard subsets feature camera perspective difference of [0◦ , 40◦ ), [40◦ , 80◦ ) and [80◦ , 120◦ ) as detailed in Section V view at source ↗

**Figure 5.** Figure 5: Sensitivity of the HCM mechanism with respect to view at source ↗

**Figure 6.** Figure 6: Sensitivity of the HCM mechanism with respect to hyper-parameters view at source ↗

**Figure 7.** Figure 7: Histograms of discretization error in 106 runs view at source ↗

read the original abstract

Motivated by the limited generalization of supervised image matching models to unseen image domains, we explore the zero-shot deployment of DINO features for this task. The generalist visual representation extracted from DINO has inherent ambiguity when used to match feature points among semantically similar instances, prompting us to adopt a many-to-many (m-to-m) matching paradigm. However, the existing robust mechanism under m-to-m data association is computationally heavy, which requires finding a maximum-cardinality matching in the inlier association graph for each parameter evaluation. To address this inefficiency, we introduce a novel likelihood perspective, which interprets the existing method as a zeroth-order approximation of otherwise intractable likelihood calculation,and inspires us to propose a faster and finer-grained robust mechanism, termed as Harmonic Consensus Maximization (HCM). Take camera pose estimation as an exemplifying downstream task, we demonstrate that general-purpose visual features, used out of the box without any adaptation, can compete with specialized matching models on out-of-distribution datasets when mated with m-to-m association and the HCM mechanism.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript explores the zero-shot deployment of unmodified DINO features for image matching tasks by adopting a many-to-many (m-to-m) association paradigm to address inherent ambiguities among semantically similar instances. It reinterprets the existing robust mechanism under m-to-m matching as a zeroth-order approximation to an otherwise intractable likelihood calculation and introduces Harmonic Consensus Maximization (HCM) as a faster, finer-grained robust alternative. Using camera pose estimation as the downstream task, the paper claims that general-purpose DINO features combined with m-to-m association and HCM can compete with specialized matching models on out-of-distribution datasets.

Significance. If the result holds and HCM is shown to preserve the robustness properties of maximum-cardinality matching, this would be significant for computer vision as it demonstrates that off-the-shelf generalist visual representations can handle challenging OOD geometric tasks without any adaptation or retraining. The likelihood-based reinterpretation provides a principled foundation for deriving new mechanisms, and the work highlights the potential of m-to-m paradigms for ambiguous feature matching scenarios.

major comments (1)

[Abstract and HCM derivation] Abstract and HCM proposal: The central claim that HCM is a valid faster approximation to the intractable likelihood (reinterpreting the existing max-cardinality method as zeroth-order) is load-bearing for the competitiveness result on OOD pose estimation. However, no explicit derivation, assumptions (e.g., independence or cardinality conditions on DINO descriptor graphs), or closed-form equivalence is provided to show that the harmonic consensus step rigorously approximates or preserves the robustness of maximum-cardinality matching. If the inlier graph induced by DINO features violates these implicit assumptions, HCM may not retain the necessary properties, undermining the switch from one-to-one matching.

minor comments (2)

[Abstract] The abstract states the demonstration on camera pose estimation but provides no equations, quantitative results, error bars, or dataset details, making immediate assessment of the 'compete with specialized models' claim difficult.
[Method introduction] The introduction of 'Harmonic Consensus Maximization (HCM)' would benefit from an early, self-contained definition of the 'harmonic' component and how it differs operationally from standard consensus maximization before the likelihood reinterpretation is presented.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and for identifying the need for greater rigor in the theoretical justification of HCM. We address the major comment below and commit to a revision that strengthens the manuscript without altering its core claims or experimental results.

read point-by-point responses

Referee: [Abstract and HCM derivation] Abstract and HCM proposal: The central claim that HCM is a valid faster approximation to the intractable likelihood (reinterpreting the existing max-cardinality method as zeroth-order) is load-bearing for the competitiveness result on OOD pose estimation. However, no explicit derivation, assumptions (e.g., independence or cardinality conditions on DINO descriptor graphs), or closed-form equivalence is provided to show that the harmonic consensus step rigorously approximates or preserves the robustness of maximum-cardinality matching. If the inlier graph induced by DINO features violates these implicit assumptions, HCM may not retain the necessary properties, undermining the switch from one-to-one matching.

Authors: We agree that the current presentation would benefit from an explicit derivation and statement of assumptions. Section 3 of the manuscript introduces the likelihood view by modeling associations as an inlier graph and treats maximum-cardinality matching as selecting the largest consistent set (a zeroth-order count-based approximation to the mode of the joint likelihood). HCM is motivated as replacing the cardinality objective with a harmonic-mean consensus score over pairwise consistencies, which is computationally lighter and incorporates descriptor similarity magnitudes. However, we acknowledge that the text does not list the required assumptions (e.g., conditional independence of edge weights given the pose hypothesis, or bounded cardinality of the true inlier set) nor supply a closed-form proof that the harmonic step preserves the same robustness guarantees. We will add a dedicated subsection with the full derivation, the explicit assumptions on DINO-induced graphs, a sketch showing equivalence under those conditions, and a short discussion of potential violations together with empirical checks confirming that HCM retains competitive robustness on the reported OOD benchmarks. This revision directly addresses the concern that the m-to-m + HCM pipeline might lose necessary properties. revision: yes

Circularity Check

0 steps flagged

No significant circularity: HCM introduced via independent likelihood reinterpretation

full rationale

The paper's derivation chain reinterprets an existing m-to-m robust mechanism as a zeroth-order likelihood approximation and proposes HCM as a faster successor. No equations, self-citations, or fitted parameters are exhibited that reduce the central claims (DINO + m-to-m + HCM for OOD pose estimation) to tautologies or inputs by construction. The likelihood view and HCM mechanism are presented as novel contributions with independent grounding, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 1 invented entities

Review conducted from abstract only; full derivations, parameter choices, and experimental protocols are unavailable. The ledger reflects only assumptions explicitly stated or implied in the abstract.

axioms (2)

domain assumption DINO features exhibit inherent ambiguity when matching points among semantically similar instances
Directly stated as the motivation prompting the m-to-m paradigm
ad hoc to paper The existing robust mechanism under m-to-m association is a zeroth-order approximation of an otherwise intractable likelihood
Invoked to justify the proposal of HCM as a faster alternative

invented entities (1)

Harmonic Consensus Maximization (HCM) no independent evidence
purpose: Faster and finer-grained robust mechanism for m-to-m data association in feature matching
Newly introduced in the paper; no independent evidence or external validation provided in the abstract

pith-pipeline@v0.9.0 · 5477 in / 1543 out tokens · 33063 ms · 2026-05-08T06:29:23.348197+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 2 internal anchors

[1]

Deep vit features as dense visual descriptors.arXiv preprint arXiv:2112.05814, 2(3):4, 2021

Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2(3):4, 2021

work page arXiv 2021
[2]

Outlier-robust estimation: Hardness, minimally tuned algorithms, and applications.IEEE Transactions on Robotics, 38(1):281–301, 2021

Pasquale Antonante, Vasileios Tzoumas, Heng Yang, and Luca Carlone. Outlier-robust estimation: Hardness, minimally tuned algorithms, and applications.IEEE Transactions on Robotics, 38(1):281–301, 2021

work page 2021
[3]

Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. InCVPR, 2017

work page 2017
[4]

Surf: Speeded up robust features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on computer vision, pages 404–417. Springer, 2006

work page 2006
[5]

Mismatched: Evaluating the limits of image matching approaches and benchmarks

Sierra Bonilla, Chiara Di Vece, Rema Daher, Xinwei Ju, Danail Stoyanov, Francisco Vasconcelos, and Sophia Bano. Mismatched: Evaluating the limits of image matching approaches and benchmarks. InEuropean Con- ference on Computer Vision, pages 120–137. Springer, 2024

work page 2024
[6]

Globally-optimal inlier set maximisation for camera pose and correspondence estimation.IEEE transactions on pattern analysis and machine intelli- gence, 42(2):328–342, 2018

Dylan Campbell, Lars Petersson, Laurent Kneip, and Hongdong Li. Globally-optimal inlier set maximisation for camera pose and correspondence estimation.IEEE transactions on pattern analysis and machine intelli- gence, 42(2):328–342, 2018

work page 2018
[7]

Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

work page 2021
[8]

Hybrid scene compression for visual localization

Federico Camposeco, Andrea Cohen, Marc Pollefeys, and Torsten Sattler. Hybrid scene compression for visual localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7653–7662, 2019

work page 2019
[9]

Sinkhorn distances: Lightspeed computa- tion of optimal transport.Advances in neural information processing systems, 26, 2013

Marco Cuturi. Sinkhorn distances: Lightspeed computa- tion of optimal transport.Advances in neural information processing systems, 26, 2013

work page 2013
[10]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

work page 2017
[11]

Superpoint: Self-supervised interest point de- tection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point de- tection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018

work page 2018
[12]

Roma: Robust dense feature matching

Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. Roma: Robust dense feature matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790–19800, 2024

work page 2024
[13]

A brute-force algorithm for reconstructing a scene from two projections

Olof Enqvist, Fangyuan Jiang, and Fredrik Kahl. A brute-force algorithm for reconstructing a scene from two projections. InCVPR 2011, pages 2961–2968. IEEE, 2011

work page 2011
[14]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commu- nications of the ACM, 24(6):381–395, 1981

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commu- nications of the ACM, 24(6):381–395, 1981

work page 1981
[15]

Optimal relative pose with unknown correspondences

Johan Fredriksson, Viktor Larsson, Carl Olsson, and Fredrik Kahl. Optimal relative pose with unknown correspondences. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1728– 1736, 2016

work page 2016
[16]

Rotation averaging.International journal of computer vision, 103(3):267–305, 2013

Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hong- dong Li. Rotation averaging.International journal of computer vision, 103(3):267–305, 2013

work page 2013
[17]

An nˆ5/2 algorithm for maximum matchings in bipartite graphs

John E Hopcroft and Richard M Karp. An nˆ5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing, 2(4):225–231, 1973

work page 1973
[18]

Navi: Category-agnostic image collections with high-quality 3d shape and pose annotations.Ad- vances in Neural Information Processing Systems, 36: 76061–76084, 2023

Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engel- hardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, Andr ´e Araujo, Ricardo Martin Brualla, Kaushal Patel, et al. Navi: Category-agnostic image collections with high-quality 3d shape and pose annotations.Ad- vances in Neural Information Processing Systems, 36: 76061–76084, 2023

work page 2023
[19]

Omniglue: Generalizable feature matching with foundation model guidance

Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Andr ´e Araujo. Omniglue: Generalizable feature matching with foundation model guidance. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19865–19875, 2024

work page 2024
[20]

Score: Saturated consensus relocalization in semantic line maps

Haodong Jiang, Xiang Zheng, Yanglin Zhang, Qingcheng Zeng, Yiqian Li, Ziyang Hong, and Junfeng Wu. Score: Saturated consensus relocalization in semantic line maps. arXiv preprint arXiv:2503.03254, 2025

work page arXiv 2025
[21]

Anyloc: Towards universal visual place recognition.IEEE Robotics and Automation Letters, 9(2):1286–1293, 2023

Nikhil Keetha, Avneesh Mishra, Jay Karhade, Kr- ishna Murthy Jatavallabhula, Sebastian Scherer, Madhava Krishna, and Sourav Garg. Anyloc: Towards universal visual place recognition.IEEE Robotics and Automation Letters, 9(2):1286–1293, 2023

work page 2023
[22]

Approximation algorithms for rectangle stabbing and interval stabbing problems.SIAM Journal on Discrete Mathematics, 20 (3):748–768, 2006

Sofia Kovaleva and Frits CR Spieksma. Approximation algorithms for rectangle stabbing and interval stabbing problems.SIAM Journal on Discrete Mathematics, 20 (3):748–768, 2006

work page 2006
[23]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J ´erˆome Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision, pages 71–91. Springer, 2024

work page 2024
[24]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018

work page 2041
[25]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023

work page 2023
[26]

Mind the gap: Aligning vision foundation models to image feature matching

Yuhan Liu, Jingwen Fu, Yang Wu, Kangyi Wu, Pengna Li, Jiayi Wu, Sanping Zhou, and Jingmin Xin. Mind the gap: Aligning vision foundation models to image feature matching. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20313–20323, 2025

work page 2025
[27]

Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60:91–110, 2004

David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60:91–110, 2004

work page 2004
[28]

Deter- ministic sample consensus with multiple match hypothe- ses

Paul McIlroy, Simon Taylor, and Tom Drummond. Deter- ministic sample consensus with multiple match hypothe- ses

work page
[29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review arXiv 2023
[30]

Fast registration based on noisy planes with unknown correspondences for 3-d mapping

Kaustubh Pathak, Andreas Birk, Narunas Va ˇskeviˇcius, and Jann Poppinga. Fast registration based on noisy planes with unknown correspondences for 3-d mapping. IEEE Transactions on Robotics, 26(3):424–441, 2010

work page 2010
[31]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PmLR, 2021

work page 2021
[32]

Neighbourhood consensus networks.Advances in neural information processing systems, 31, 2018

Ignacio Rocco, Mircea Cimpoi, Relja Arandjelovi ´c, Aki- hiko Torii, Tomas Pajdla, and Josef Sivic. Neighbourhood consensus networks.Advances in neural information processing systems, 31, 2018

work page 2018
[33]

From coarse to fine: Robust hierarchical localization at large scale

Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12716–12725, 2019

work page 2019
[34]

Superglue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Mal- isiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020

work page 2020
[35]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review arXiv 2025
[36]

Diagonal equivalence to matrices with prescribed row and column sums.The American Mathematical Monthly, 74(4):402–405, 1967

Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums.The American Mathematical Monthly, 74(4):402–405, 1967

work page 1967
[37]

Concerning nonneg- ative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967

Richard Sinkhorn and Paul Knopp. Concerning nonneg- ative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967

work page 1967
[38]

Loftr: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021

work page 2021
[39]

Emergent correspondence from image diffusion.Advances in Neural Information Processing Systems, 36:1363–1389, 2023

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.Advances in Neural Information Processing Systems, 36:1363–1389, 2023

work page 2023
[40]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017
[41]

Efficient loftr: Semi-dense local feature matching with sparse-like speed

Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient loftr: Semi-dense local feature matching with sparse-like speed. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21666–21675, 2024

work page 2024
[42]

Certifiably optimal mutual lo- calization with anonymous bearing measurements.IEEE Robotics and Automation Letters, 7(4):9374–9381, 2022

Yingjian Wang, Xiangyong Wen, Longji Yin, Chao Xu, Yanjun Cao, and Fei Gao. Certifiably optimal mutual lo- calization with anonymous bearing measurements.IEEE Robotics and Automation Letters, 7(4):9374–9381, 2022

work page 2022
[43]

Teaser: Fast and certifiable point cloud registration.IEEE Transac- tions on Robotics, 37(2):314–333, 2020

Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point cloud registration.IEEE Transac- tions on Robotics, 37(2):314–333, 2020

work page 2020
[44]

Optimal essential matrix estimation via inlier-set maximization

Jiaolong Yang, Hongdong Li, and Yunde Jia. Optimal essential matrix estimation via inlier-set maximization. InEuropean Conference on Computer Vision, pages 111–

work page
[45]

Lift: Learned invariant feature transform

Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. InEuropean conference on computer vision, pages 467–

work page
[46]

A tale of two features: Stable diffu- sion complements dino for zero-shot semantic corre- spondence.Advances in Neural Information Processing Systems, 36:45533–45547, 2023

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming- Hsuan Yang. A tale of two features: Stable diffu- sion complements dino for zero-shot semantic corre- spondence.Advances in Neural Information Processing Systems, 36:45533–45547, 2023

work page 2023
[47]

Telling left from right: Identifying geometry-aware se- mantic correspondence

Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. Telling left from right: Identifying geometry-aware se- mantic correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3076–3085, 2024. APPENDIXA HYPER-PARAMETERSENSITIVITYANALYSIS Recall thatHarmonic Con...

work page 2024

[1] [1]

Deep vit features as dense visual descriptors.arXiv preprint arXiv:2112.05814, 2(3):4, 2021

Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2(3):4, 2021

work page arXiv 2021

[2] [2]

Outlier-robust estimation: Hardness, minimally tuned algorithms, and applications.IEEE Transactions on Robotics, 38(1):281–301, 2021

Pasquale Antonante, Vasileios Tzoumas, Heng Yang, and Luca Carlone. Outlier-robust estimation: Hardness, minimally tuned algorithms, and applications.IEEE Transactions on Robotics, 38(1):281–301, 2021

work page 2021

[3] [3]

Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. InCVPR, 2017

work page 2017

[4] [4]

Surf: Speeded up robust features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on computer vision, pages 404–417. Springer, 2006

work page 2006

[5] [5]

Mismatched: Evaluating the limits of image matching approaches and benchmarks

Sierra Bonilla, Chiara Di Vece, Rema Daher, Xinwei Ju, Danail Stoyanov, Francisco Vasconcelos, and Sophia Bano. Mismatched: Evaluating the limits of image matching approaches and benchmarks. InEuropean Con- ference on Computer Vision, pages 120–137. Springer, 2024

work page 2024

[6] [6]

Globally-optimal inlier set maximisation for camera pose and correspondence estimation.IEEE transactions on pattern analysis and machine intelli- gence, 42(2):328–342, 2018

Dylan Campbell, Lars Petersson, Laurent Kneip, and Hongdong Li. Globally-optimal inlier set maximisation for camera pose and correspondence estimation.IEEE transactions on pattern analysis and machine intelli- gence, 42(2):328–342, 2018

work page 2018

[7] [7]

Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021

work page 2021

[8] [8]

Hybrid scene compression for visual localization

Federico Camposeco, Andrea Cohen, Marc Pollefeys, and Torsten Sattler. Hybrid scene compression for visual localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7653–7662, 2019

work page 2019

[9] [9]

Sinkhorn distances: Lightspeed computa- tion of optimal transport.Advances in neural information processing systems, 26, 2013

Marco Cuturi. Sinkhorn distances: Lightspeed computa- tion of optimal transport.Advances in neural information processing systems, 26, 2013

work page 2013

[10] [10]

Scannet: Richly-annotated 3d reconstructions of indoor scenes

Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017

work page 2017

[11] [11]

Superpoint: Self-supervised interest point de- tection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point de- tection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018

work page 2018

[12] [12]

Roma: Robust dense feature matching

Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. Roma: Robust dense feature matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790–19800, 2024

work page 2024

[13] [13]

A brute-force algorithm for reconstructing a scene from two projections

Olof Enqvist, Fangyuan Jiang, and Fredrik Kahl. A brute-force algorithm for reconstructing a scene from two projections. InCVPR 2011, pages 2961–2968. IEEE, 2011

work page 2011

[14] [14]

Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commu- nications of the ACM, 24(6):381–395, 1981

Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commu- nications of the ACM, 24(6):381–395, 1981

work page 1981

[15] [15]

Optimal relative pose with unknown correspondences

Johan Fredriksson, Viktor Larsson, Carl Olsson, and Fredrik Kahl. Optimal relative pose with unknown correspondences. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1728– 1736, 2016

work page 2016

[16] [16]

Rotation averaging.International journal of computer vision, 103(3):267–305, 2013

Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hong- dong Li. Rotation averaging.International journal of computer vision, 103(3):267–305, 2013

work page 2013

[17] [17]

An nˆ5/2 algorithm for maximum matchings in bipartite graphs

John E Hopcroft and Richard M Karp. An nˆ5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing, 2(4):225–231, 1973

work page 1973

[18] [18]

Navi: Category-agnostic image collections with high-quality 3d shape and pose annotations.Ad- vances in Neural Information Processing Systems, 36: 76061–76084, 2023

Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engel- hardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, Andr ´e Araujo, Ricardo Martin Brualla, Kaushal Patel, et al. Navi: Category-agnostic image collections with high-quality 3d shape and pose annotations.Ad- vances in Neural Information Processing Systems, 36: 76061–76084, 2023

work page 2023

[19] [19]

Omniglue: Generalizable feature matching with foundation model guidance

Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Andr ´e Araujo. Omniglue: Generalizable feature matching with foundation model guidance. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19865–19875, 2024

work page 2024

[20] [20]

Score: Saturated consensus relocalization in semantic line maps

Haodong Jiang, Xiang Zheng, Yanglin Zhang, Qingcheng Zeng, Yiqian Li, Ziyang Hong, and Junfeng Wu. Score: Saturated consensus relocalization in semantic line maps. arXiv preprint arXiv:2503.03254, 2025

work page arXiv 2025

[21] [21]

Anyloc: Towards universal visual place recognition.IEEE Robotics and Automation Letters, 9(2):1286–1293, 2023

Nikhil Keetha, Avneesh Mishra, Jay Karhade, Kr- ishna Murthy Jatavallabhula, Sebastian Scherer, Madhava Krishna, and Sourav Garg. Anyloc: Towards universal visual place recognition.IEEE Robotics and Automation Letters, 9(2):1286–1293, 2023

work page 2023

[22] [22]

Approximation algorithms for rectangle stabbing and interval stabbing problems.SIAM Journal on Discrete Mathematics, 20 (3):748–768, 2006

Sofia Kovaleva and Frits CR Spieksma. Approximation algorithms for rectangle stabbing and interval stabbing problems.SIAM Journal on Discrete Mathematics, 20 (3):748–768, 2006

work page 2006

[23] [23]

Grounding image matching in 3d with mast3r

Vincent Leroy, Yohann Cabon, and J ´erˆome Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision, pages 71–91. Springer, 2024

work page 2024

[24] [24]

Megadepth: Learning single-view depth prediction from internet photos

Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018

work page 2041

[25] [25]

Lightglue: Local feature matching at light speed

Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023

work page 2023

[26] [26]

Mind the gap: Aligning vision foundation models to image feature matching

Yuhan Liu, Jingwen Fu, Yang Wu, Kangyi Wu, Pengna Li, Jiayi Wu, Sanping Zhou, and Jingmin Xin. Mind the gap: Aligning vision foundation models to image feature matching. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20313–20323, 2025

work page 2025

[27] [27]

Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60:91–110, 2004

David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60:91–110, 2004

work page 2004

[28] [28]

Deter- ministic sample consensus with multiple match hypothe- ses

Paul McIlroy, Simon Taylor, and Tom Drummond. Deter- ministic sample consensus with multiple match hypothe- ses

work page

[29] [29]

DINOv2: Learning Robust Visual Features without Supervision

Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023

work page internal anchor Pith review arXiv 2023

[30] [30]

Fast registration based on noisy planes with unknown correspondences for 3-d mapping

Kaustubh Pathak, Andreas Birk, Narunas Va ˇskeviˇcius, and Jann Poppinga. Fast registration based on noisy planes with unknown correspondences for 3-d mapping. IEEE Transactions on Robotics, 26(3):424–441, 2010

work page 2010

[31] [31]

Learning transferable visual models from natural lan- guage supervision

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PmLR, 2021

work page 2021

[32] [32]

Neighbourhood consensus networks.Advances in neural information processing systems, 31, 2018

Ignacio Rocco, Mircea Cimpoi, Relja Arandjelovi ´c, Aki- hiko Torii, Tomas Pajdla, and Josef Sivic. Neighbourhood consensus networks.Advances in neural information processing systems, 31, 2018

work page 2018

[33] [33]

From coarse to fine: Robust hierarchical localization at large scale

Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12716–12725, 2019

work page 2019

[34] [34]

Superglue: Learning feature matching with graph neural networks

Paul-Edouard Sarlin, Daniel DeTone, Tomasz Mal- isiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020

work page 2020

[35] [35]

DINOv3

Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025

work page internal anchor Pith review arXiv 2025

[36] [36]

Diagonal equivalence to matrices with prescribed row and column sums.The American Mathematical Monthly, 74(4):402–405, 1967

Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums.The American Mathematical Monthly, 74(4):402–405, 1967

work page 1967

[37] [37]

Concerning nonneg- ative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967

Richard Sinkhorn and Paul Knopp. Concerning nonneg- ative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967

work page 1967

[38] [38]

Loftr: Detector-free local feature matching with transformers

Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021

work page 2021

[39] [39]

Emergent correspondence from image diffusion.Advances in Neural Information Processing Systems, 36:1363–1389, 2023

Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.Advances in Neural Information Processing Systems, 36:1363–1389, 2023

work page 2023

[40] [40]

Attention is all you need.Advances in neural information processing systems, 30, 2017

Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017

work page 2017

[41] [41]

Efficient loftr: Semi-dense local feature matching with sparse-like speed

Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient loftr: Semi-dense local feature matching with sparse-like speed. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21666–21675, 2024

work page 2024

[42] [42]

Certifiably optimal mutual lo- calization with anonymous bearing measurements.IEEE Robotics and Automation Letters, 7(4):9374–9381, 2022

Yingjian Wang, Xiangyong Wen, Longji Yin, Chao Xu, Yanjun Cao, and Fei Gao. Certifiably optimal mutual lo- calization with anonymous bearing measurements.IEEE Robotics and Automation Letters, 7(4):9374–9381, 2022

work page 2022

[43] [43]

Teaser: Fast and certifiable point cloud registration.IEEE Transac- tions on Robotics, 37(2):314–333, 2020

Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point cloud registration.IEEE Transac- tions on Robotics, 37(2):314–333, 2020

work page 2020

[44] [44]

Optimal essential matrix estimation via inlier-set maximization

Jiaolong Yang, Hongdong Li, and Yunde Jia. Optimal essential matrix estimation via inlier-set maximization. InEuropean Conference on Computer Vision, pages 111–

work page

[45] [45]

Lift: Learned invariant feature transform

Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. InEuropean conference on computer vision, pages 467–

work page

[46] [46]

A tale of two features: Stable diffu- sion complements dino for zero-shot semantic corre- spondence.Advances in Neural Information Processing Systems, 36:45533–45547, 2023

Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming- Hsuan Yang. A tale of two features: Stable diffu- sion complements dino for zero-shot semantic corre- spondence.Advances in Neural Information Processing Systems, 36:45533–45547, 2023

work page 2023

[47] [47]

Telling left from right: Identifying geometry-aware se- mantic correspondence

Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. Telling left from right: Identifying geometry-aware se- mantic correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3076–3085, 2024. APPENDIXA HYPER-PARAMETERSENSITIVITYANALYSIS Recall thatHarmonic Con...

work page 2024