Deploy DINO with Many-to-Many Association
Pith reviewed 2026-05-08 06:29 UTC · model grok-4.3
The pith
General DINO features compete with specialized matching models on out-of-distribution datasets using many-to-many association and HCM.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Adopting many-to-many association for DINO features to manage inherent ambiguity on semantically similar instances, and introducing Harmonic Consensus Maximization as a likelihood-based efficient robust mechanism, allows these general-purpose features to achieve performance comparable to specialized matching models on out-of-distribution datasets in downstream tasks such as camera pose estimation.
What carries the argument
Harmonic Consensus Maximization (HCM), which provides a faster and finer-grained robust estimation by interpreting the problem from a likelihood perspective instead of computing maximum-cardinality matchings for each parameter hypothesis.
Load-bearing premise
The assumption that DINO features require a many-to-many paradigm because of ambiguity on similar instances and that HCM's likelihood approximation delivers equivalent robustness for tasks like camera pose estimation.
What would settle it
Running camera pose estimation experiments on multiple out-of-distribution datasets comparing accuracy and runtime of DINO with m-to-m plus HCM against specialized matching models; superior or equal performance on accuracy with better efficiency would confirm the claim.
Figures
read the original abstract
Motivated by the limited generalization of supervised image matching models to unseen image domains, we explore the zero-shot deployment of DINO features for this task. The generalist visual representation extracted from DINO has inherent ambiguity when used to match feature points among semantically similar instances, prompting us to adopt a many-to-many (m-to-m) matching paradigm. However, the existing robust mechanism under m-to-m data association is computationally heavy, which requires finding a maximum-cardinality matching in the inlier association graph for each parameter evaluation. To address this inefficiency, we introduce a novel likelihood perspective, which interprets the existing method as a zeroth-order approximation of otherwise intractable likelihood calculation,and inspires us to propose a faster and finer-grained robust mechanism, termed as Harmonic Consensus Maximization (HCM). Take camera pose estimation as an exemplifying downstream task, we demonstrate that general-purpose visual features, used out of the box without any adaptation, can compete with specialized matching models on out-of-distribution datasets when mated with m-to-m association and the HCM mechanism.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript explores the zero-shot deployment of unmodified DINO features for image matching tasks by adopting a many-to-many (m-to-m) association paradigm to address inherent ambiguities among semantically similar instances. It reinterprets the existing robust mechanism under m-to-m matching as a zeroth-order approximation to an otherwise intractable likelihood calculation and introduces Harmonic Consensus Maximization (HCM) as a faster, finer-grained robust alternative. Using camera pose estimation as the downstream task, the paper claims that general-purpose DINO features combined with m-to-m association and HCM can compete with specialized matching models on out-of-distribution datasets.
Significance. If the result holds and HCM is shown to preserve the robustness properties of maximum-cardinality matching, this would be significant for computer vision as it demonstrates that off-the-shelf generalist visual representations can handle challenging OOD geometric tasks without any adaptation or retraining. The likelihood-based reinterpretation provides a principled foundation for deriving new mechanisms, and the work highlights the potential of m-to-m paradigms for ambiguous feature matching scenarios.
major comments (1)
- [Abstract and HCM derivation] Abstract and HCM proposal: The central claim that HCM is a valid faster approximation to the intractable likelihood (reinterpreting the existing max-cardinality method as zeroth-order) is load-bearing for the competitiveness result on OOD pose estimation. However, no explicit derivation, assumptions (e.g., independence or cardinality conditions on DINO descriptor graphs), or closed-form equivalence is provided to show that the harmonic consensus step rigorously approximates or preserves the robustness of maximum-cardinality matching. If the inlier graph induced by DINO features violates these implicit assumptions, HCM may not retain the necessary properties, undermining the switch from one-to-one matching.
minor comments (2)
- [Abstract] The abstract states the demonstration on camera pose estimation but provides no equations, quantitative results, error bars, or dataset details, making immediate assessment of the 'compete with specialized models' claim difficult.
- [Method introduction] The introduction of 'Harmonic Consensus Maximization (HCM)' would benefit from an early, self-contained definition of the 'harmonic' component and how it differs operationally from standard consensus maximization before the likelihood reinterpretation is presented.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and for identifying the need for greater rigor in the theoretical justification of HCM. We address the major comment below and commit to a revision that strengthens the manuscript without altering its core claims or experimental results.
read point-by-point responses
-
Referee: [Abstract and HCM derivation] Abstract and HCM proposal: The central claim that HCM is a valid faster approximation to the intractable likelihood (reinterpreting the existing max-cardinality method as zeroth-order) is load-bearing for the competitiveness result on OOD pose estimation. However, no explicit derivation, assumptions (e.g., independence or cardinality conditions on DINO descriptor graphs), or closed-form equivalence is provided to show that the harmonic consensus step rigorously approximates or preserves the robustness of maximum-cardinality matching. If the inlier graph induced by DINO features violates these implicit assumptions, HCM may not retain the necessary properties, undermining the switch from one-to-one matching.
Authors: We agree that the current presentation would benefit from an explicit derivation and statement of assumptions. Section 3 of the manuscript introduces the likelihood view by modeling associations as an inlier graph and treats maximum-cardinality matching as selecting the largest consistent set (a zeroth-order count-based approximation to the mode of the joint likelihood). HCM is motivated as replacing the cardinality objective with a harmonic-mean consensus score over pairwise consistencies, which is computationally lighter and incorporates descriptor similarity magnitudes. However, we acknowledge that the text does not list the required assumptions (e.g., conditional independence of edge weights given the pose hypothesis, or bounded cardinality of the true inlier set) nor supply a closed-form proof that the harmonic step preserves the same robustness guarantees. We will add a dedicated subsection with the full derivation, the explicit assumptions on DINO-induced graphs, a sketch showing equivalence under those conditions, and a short discussion of potential violations together with empirical checks confirming that HCM retains competitive robustness on the reported OOD benchmarks. This revision directly addresses the concern that the m-to-m + HCM pipeline might lose necessary properties. revision: yes
Circularity Check
No significant circularity: HCM introduced via independent likelihood reinterpretation
full rationale
The paper's derivation chain reinterprets an existing m-to-m robust mechanism as a zeroth-order likelihood approximation and proposes HCM as a faster successor. No equations, self-citations, or fitted parameters are exhibited that reduce the central claims (DINO + m-to-m + HCM for OOD pose estimation) to tautologies or inputs by construction. The likelihood view and HCM mechanism are presented as novel contributions with independent grounding, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption DINO features exhibit inherent ambiguity when matching points among semantically similar instances
- ad hoc to paper The existing robust mechanism under m-to-m association is a zeroth-order approximation of an otherwise intractable likelihood
invented entities (1)
-
Harmonic Consensus Maximization (HCM)
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Deep vit features as dense visual descriptors.arXiv preprint arXiv:2112.05814, 2(3):4, 2021
Shir Amir, Yossi Gandelsman, Shai Bagon, and Tali Dekel. Deep vit features as dense visual descriptors. arXiv preprint arXiv:2112.05814, 2(3):4, 2021
-
[2]
Pasquale Antonante, Vasileios Tzoumas, Heng Yang, and Luca Carlone. Outlier-robust estimation: Hardness, minimally tuned algorithms, and applications.IEEE Transactions on Robotics, 38(1):281–301, 2021
work page 2021
-
[3]
Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors
Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krystian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. InCVPR, 2017
work page 2017
-
[4]
Surf: Speeded up robust features
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on computer vision, pages 404–417. Springer, 2006
work page 2006
-
[5]
Mismatched: Evaluating the limits of image matching approaches and benchmarks
Sierra Bonilla, Chiara Di Vece, Rema Daher, Xinwei Ju, Danail Stoyanov, Francisco Vasconcelos, and Sophia Bano. Mismatched: Evaluating the limits of image matching approaches and benchmarks. InEuropean Con- ference on Computer Vision, pages 120–137. Springer, 2024
work page 2024
-
[6]
Dylan Campbell, Lars Petersson, Laurent Kneip, and Hongdong Li. Globally-optimal inlier set maximisation for camera pose and correspondence estimation.IEEE transactions on pattern analysis and machine intelli- gence, 42(2):328–342, 2018
work page 2018
-
[7]
Carlos Campos, Richard Elvira, Juan J G ´omez Rodr´ıguez, Jos ´e MM Montiel, and Juan D Tard ´os. Orb-slam3: An accurate open-source library for visual, visual–inertial, and multimap slam.IEEE transactions on robotics, 37(6):1874–1890, 2021
work page 2021
-
[8]
Hybrid scene compression for visual localization
Federico Camposeco, Andrea Cohen, Marc Pollefeys, and Torsten Sattler. Hybrid scene compression for visual localization. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7653–7662, 2019
work page 2019
-
[9]
Marco Cuturi. Sinkhorn distances: Lightspeed computa- tion of optimal transport.Advances in neural information processing systems, 26, 2013
work page 2013
-
[10]
Scannet: Richly-annotated 3d reconstructions of indoor scenes
Angela Dai, Angel X Chang, Manolis Savva, Maciej Hal- ber, Thomas Funkhouser, and Matthias Nießner. Scannet: Richly-annotated 3d reconstructions of indoor scenes. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 5828–5839, 2017
work page 2017
-
[11]
Superpoint: Self-supervised interest point de- tection and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point de- tection and description. InProceedings of the IEEE conference on computer vision and pattern recognition workshops, pages 224–236, 2018
work page 2018
-
[12]
Roma: Robust dense feature matching
Johan Edstedt, Qiyu Sun, Georg B ¨okman, M ˚arten Wadenb¨ack, and Michael Felsberg. Roma: Robust dense feature matching. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19790–19800, 2024
work page 2024
-
[13]
A brute-force algorithm for reconstructing a scene from two projections
Olof Enqvist, Fangyuan Jiang, and Fredrik Kahl. A brute-force algorithm for reconstructing a scene from two projections. InCVPR 2011, pages 2961–2968. IEEE, 2011
work page 2011
-
[14]
Martin A Fischler and Robert C Bolles. Random sample consensus: a paradigm for model fitting with applications to image analysis and automated cartography.Commu- nications of the ACM, 24(6):381–395, 1981
work page 1981
-
[15]
Optimal relative pose with unknown correspondences
Johan Fredriksson, Viktor Larsson, Carl Olsson, and Fredrik Kahl. Optimal relative pose with unknown correspondences. InProceedings of the IEEE conference on computer vision and pattern recognition, pages 1728– 1736, 2016
work page 2016
-
[16]
Rotation averaging.International journal of computer vision, 103(3):267–305, 2013
Richard Hartley, Jochen Trumpf, Yuchao Dai, and Hong- dong Li. Rotation averaging.International journal of computer vision, 103(3):267–305, 2013
work page 2013
-
[17]
An nˆ5/2 algorithm for maximum matchings in bipartite graphs
John E Hopcroft and Richard M Karp. An nˆ5/2 algorithm for maximum matchings in bipartite graphs. SIAM Journal on computing, 2(4):225–231, 1973
work page 1973
-
[18]
Varun Jampani, Kevis-Kokitsi Maninis, Andreas Engel- hardt, Arjun Karpur, Karen Truong, Kyle Sargent, Stefan Popov, Andr ´e Araujo, Ricardo Martin Brualla, Kaushal Patel, et al. Navi: Category-agnostic image collections with high-quality 3d shape and pose annotations.Ad- vances in Neural Information Processing Systems, 36: 76061–76084, 2023
work page 2023
-
[19]
Omniglue: Generalizable feature matching with foundation model guidance
Hanwen Jiang, Arjun Karpur, Bingyi Cao, Qixing Huang, and Andr ´e Araujo. Omniglue: Generalizable feature matching with foundation model guidance. InProceed- ings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19865–19875, 2024
work page 2024
-
[20]
Score: Saturated consensus relocalization in semantic line maps
Haodong Jiang, Xiang Zheng, Yanglin Zhang, Qingcheng Zeng, Yiqian Li, Ziyang Hong, and Junfeng Wu. Score: Saturated consensus relocalization in semantic line maps. arXiv preprint arXiv:2503.03254, 2025
-
[21]
Nikhil Keetha, Avneesh Mishra, Jay Karhade, Kr- ishna Murthy Jatavallabhula, Sebastian Scherer, Madhava Krishna, and Sourav Garg. Anyloc: Towards universal visual place recognition.IEEE Robotics and Automation Letters, 9(2):1286–1293, 2023
work page 2023
-
[22]
Sofia Kovaleva and Frits CR Spieksma. Approximation algorithms for rectangle stabbing and interval stabbing problems.SIAM Journal on Discrete Mathematics, 20 (3):748–768, 2006
work page 2006
-
[23]
Grounding image matching in 3d with mast3r
Vincent Leroy, Yohann Cabon, and J ´erˆome Revaud. Grounding image matching in 3d with mast3r. In European conference on computer vision, pages 71–91. Springer, 2024
work page 2024
-
[24]
Megadepth: Learning single-view depth prediction from internet photos
Zhengqi Li and Noah Snavely. Megadepth: Learning single-view depth prediction from internet photos. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 2041–2050, 2018
work page 2041
-
[25]
Lightglue: Local feature matching at light speed
Philipp Lindenberger, Paul-Edouard Sarlin, and Marc Pollefeys. Lightglue: Local feature matching at light speed. InProceedings of the IEEE/CVF international conference on computer vision, pages 17627–17638, 2023
work page 2023
-
[26]
Mind the gap: Aligning vision foundation models to image feature matching
Yuhan Liu, Jingwen Fu, Yang Wu, Kangyi Wu, Pengna Li, Jiayi Wu, Sanping Zhou, and Jingmin Xin. Mind the gap: Aligning vision foundation models to image feature matching. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 20313–20323, 2025
work page 2025
-
[27]
David G Lowe. Distinctive image features from scale- invariant keypoints.International journal of computer vision, 60:91–110, 2004
work page 2004
-
[28]
Deter- ministic sample consensus with multiple match hypothe- ses
Paul McIlroy, Simon Taylor, and Tom Drummond. Deter- ministic sample consensus with multiple match hypothe- ses
-
[29]
DINOv2: Learning Robust Visual Features without Supervision
Maxime Oquab, Timoth ´ee Darcet, Th ´eo Moutakanni, Huy V o, Marc Szafraniec, Vasil Khalidov, Pierre Fer- nandez, Daniel Haziza, Francisco Massa, Alaaeldin El- Nouby, et al. Dinov2: Learning robust visual features without supervision.arXiv preprint arXiv:2304.07193, 2023
work page internal anchor Pith review arXiv 2023
-
[30]
Fast registration based on noisy planes with unknown correspondences for 3-d mapping
Kaustubh Pathak, Andreas Birk, Narunas Va ˇskeviˇcius, and Jann Poppinga. Fast registration based on noisy planes with unknown correspondences for 3-d mapping. IEEE Transactions on Robotics, 26(3):424–441, 2010
work page 2010
-
[31]
Learning transferable visual models from natural lan- guage supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural lan- guage supervision. InInternational conference on ma- chine learning, pages 8748–8763. PmLR, 2021
work page 2021
-
[32]
Neighbourhood consensus networks.Advances in neural information processing systems, 31, 2018
Ignacio Rocco, Mircea Cimpoi, Relja Arandjelovi ´c, Aki- hiko Torii, Tomas Pajdla, and Josef Sivic. Neighbourhood consensus networks.Advances in neural information processing systems, 31, 2018
work page 2018
-
[33]
From coarse to fine: Robust hierarchical localization at large scale
Paul-Edouard Sarlin, Cesar Cadena, Roland Siegwart, and Marcin Dymczyk. From coarse to fine: Robust hierarchical localization at large scale. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12716–12725, 2019
work page 2019
-
[34]
Superglue: Learning feature matching with graph neural networks
Paul-Edouard Sarlin, Daniel DeTone, Tomasz Mal- isiewicz, and Andrew Rabinovich. Superglue: Learning feature matching with graph neural networks. InPro- ceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 4938–4947, 2020
work page 2020
-
[35]
Oriane Sim ´eoni, Huy V V o, Maximilian Seitzer, Federico Baldassarre, Maxime Oquab, Cijo Jose, Vasil Khalidov, Marc Szafraniec, Seungeun Yi, Micha ¨el Ramamonjisoa, et al. Dinov3.arXiv preprint arXiv:2508.10104, 2025
work page internal anchor Pith review arXiv 2025
-
[36]
Richard Sinkhorn. Diagonal equivalence to matrices with prescribed row and column sums.The American Mathematical Monthly, 74(4):402–405, 1967
work page 1967
-
[37]
Richard Sinkhorn and Paul Knopp. Concerning nonneg- ative matrices and doubly stochastic matrices.Pacific Journal of Mathematics, 21(2):343–348, 1967
work page 1967
-
[38]
Loftr: Detector-free local feature matching with transformers
Jiaming Sun, Zehong Shen, Yuang Wang, Hujun Bao, and Xiaowei Zhou. Loftr: Detector-free local feature matching with transformers. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 8922–8931, 2021
work page 2021
-
[39]
Luming Tang, Menglin Jia, Qianqian Wang, Cheng Perng Phoo, and Bharath Hariharan. Emergent correspondence from image diffusion.Advances in Neural Information Processing Systems, 36:1363–1389, 2023
work page 2023
-
[40]
Attention is all you need.Advances in neural information processing systems, 30, 2017
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez, Łukasz Kaiser, and Illia Polosukhin. Attention is all you need.Advances in neural information processing systems, 30, 2017
work page 2017
-
[41]
Efficient loftr: Semi-dense local feature matching with sparse-like speed
Yifan Wang, Xingyi He, Sida Peng, Dongli Tan, and Xiaowei Zhou. Efficient loftr: Semi-dense local feature matching with sparse-like speed. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 21666–21675, 2024
work page 2024
-
[42]
Yingjian Wang, Xiangyong Wen, Longji Yin, Chao Xu, Yanjun Cao, and Fei Gao. Certifiably optimal mutual lo- calization with anonymous bearing measurements.IEEE Robotics and Automation Letters, 7(4):9374–9381, 2022
work page 2022
-
[43]
Heng Yang, Jingnan Shi, and Luca Carlone. Teaser: Fast and certifiable point cloud registration.IEEE Transac- tions on Robotics, 37(2):314–333, 2020
work page 2020
-
[44]
Optimal essential matrix estimation via inlier-set maximization
Jiaolong Yang, Hongdong Li, and Yunde Jia. Optimal essential matrix estimation via inlier-set maximization. InEuropean Conference on Computer Vision, pages 111–
-
[45]
Lift: Learned invariant feature transform
Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. InEuropean conference on computer vision, pages 467–
-
[46]
Junyi Zhang, Charles Herrmann, Junhwa Hur, Luisa Polania Cabrera, Varun Jampani, Deqing Sun, and Ming- Hsuan Yang. A tale of two features: Stable diffu- sion complements dino for zero-shot semantic corre- spondence.Advances in Neural Information Processing Systems, 36:45533–45547, 2023
work page 2023
-
[47]
Telling left from right: Identifying geometry-aware se- mantic correspondence
Junyi Zhang, Charles Herrmann, Junhwa Hur, Eric Chen, Varun Jampani, Deqing Sun, and Ming-Hsuan Yang. Telling left from right: Identifying geometry-aware se- mantic correspondence. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 3076–3085, 2024. APPENDIXA HYPER-PARAMETERSENSITIVITYANALYSIS Recall thatHarmonic Con...
work page 2024
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.