UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor
Pith reviewed 2026-05-25 00:49 UTC · model grok-4.3
The pith
UnsuperPoint trains an interest point detector and descriptor unsupervised via a siamese network and novel loss functions without pseudo-labels.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
UnsuperPoint is an unsupervised deep learning interest point detector and descriptor learned through self-supervised siamese training with a novel loss function that automatically learns point scores and positions via regression, incorporates non-maximum suppression inside the model, and regularizes predictions to be uniformly distributed, all without generating pseudo ground truth points, using structure-from-motion representations, or performing multiple training rounds.
What carries the argument
Self-supervised siamese network with regression of point positions to enable end-to-end training and non-maximum suppression, plus a uniformity loss that regularizes network predictions.
If this is right
- The detector and descriptor become end-to-end trainable without separate stages for pseudo-label creation.
- Non-maximum suppression is handled inside the learned model rather than as a post-process.
- The model achieves 323 fps at 224x320 resolution and 90 fps at 480x640 while remaining competitive on HPatches metrics.
- Only one round of training is needed, avoiding iterative pseudo-ground-truth pipelines.
Where Pith is reading between the lines
- The uniformity loss could be adapted to other unsupervised feature tasks to prevent degenerate clustering of predictions.
- Single-pass training lowers the barrier to experimenting with new detector architectures compared with multi-stage label-refinement methods.
- Real-time operation supports direct deployment inside larger vision pipelines such as visual odometry on embedded hardware.
Load-bearing premise
The siamese self-supervision together with the new losses on scores, positions, and uniformity will produce repeatable and useful interest points without any external labels or pseudo-data.
What would settle it
Train the network once on the described losses and evaluate repeatability and matching score on HPatches image pairs; if performance remains substantially below supervised baselines and does not improve with the uniformity term, the unsupervised claim would not hold.
Figures
read the original abstract
It is hard to create consistent ground truth data for interest points in natural images, since interest points are hard to define clearly and consistently for a human annotator. This makes interest point detectors non-trivial to build. In this work, we introduce an unsupervised deep learning-based interest point detector and descriptor. Using a self-supervised approach, we utilize a siamese network and a novel loss function that enables interest point scores and positions to be learned automatically. The resulting interest point detector and descriptor is UnsuperPoint. We use regression of point positions to 1) make UnsuperPoint end-to-end trainable and 2) to incorporate non-maximum suppression in the model. Unlike most trainable detectors, it requires no generation of pseudo ground truth points, no structure-from-motion-generated representations and the model is learned from only one round of training. Furthermore, we introduce a novel loss function to regularize network predictions to be uniformly distributed. UnsuperPoint runs in real-time with 323 frames per second (fps) at a resolution of $224\times320$ and 90 fps at $480\times640$. It is comparable or better than state-of-the-art performance when measured for speed, repeatability, localization, matching score and homography estimation on the HPatch dataset.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces UnsuperPoint, an end-to-end unsupervised interest point detector and descriptor. It employs a siamese network trained with novel losses on point scores, regressed positions (to enable end-to-end differentiability and incorporate NMS), and a uniform distribution regularizer. The approach requires no pseudo-ground-truth generation or SfM, uses only a single training round, runs at real-time speeds (323 fps at 224×320), and claims performance comparable or superior to prior methods on repeatability, localization error, matching score, and homography estimation on the HPatches dataset.
Significance. If the empirical claims hold under rigorous validation, the work would be significant for demonstrating a fully unsupervised, single-stage pipeline that avoids common dependencies on pseudo-labels or multi-view geometry, thereby simplifying training of repeatable detectors and descriptors while maintaining practical speed.
major comments (3)
- [§3] §3 (loss functions): the uniform-distribution term prevents density collapse but supplies no explicit mechanism to enforce repeatability or viewpoint invariance; the siamese consistency objective alone may be satisfied by low-level image artifacts that happen to be stable in the training distribution rather than by semantically useful points. This assumption is load-bearing for the central claim that the losses suffice without external supervision.
- [§3.2] §3.2 (position regression): folding NMS into the model via learned offsets assumes the regressed positions remain accurate after selection, yet the self-supervised objective provides no direct supervision or penalty on post-NMS localization error; this is load-bearing for the end-to-end training claim.
- [§4] §4 (experiments): the abstract asserts competitive or superior performance on HPatches but the provided text supplies no quantitative tables, error bars, dataset splits, or direct numerical comparisons to baselines such as SuperPoint; without these, the strength of the empirical support cannot be assessed.
minor comments (2)
- [Abstract] Abstract: key numerical results (repeatability scores, matching scores, etc.) should be included to allow readers to evaluate the performance claims immediately.
- [§3] Notation: the precise formulation of the three loss terms and their weighting hyperparameters should be stated explicitly with equation numbers for reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive comments. We respond point-by-point to the major concerns below.
read point-by-point responses
-
Referee: [§3] §3 (loss functions): the uniform-distribution term prevents density collapse but supplies no explicit mechanism to enforce repeatability or viewpoint invariance; the siamese consistency objective alone may be satisfied by low-level image artifacts that happen to be stable in the training distribution rather than by semantically useful points. This assumption is load-bearing for the central claim that the losses suffice without external supervision.
Authors: The siamese consistency loss is computed on image pairs related by random homographies that simulate viewpoint variation. This forces both point scores and descriptors to be consistent under geometric change, which goes beyond stable low-level artifacts. The uniform regularizer further discourages collapse to trivial solutions. Empirical results on HPatches (repeatability, matching score, homography estimation) support that the learned points are useful rather than artifact-driven. revision: no
-
Referee: [§3.2] §3.2 (position regression): folding NMS into the model via learned offsets assumes the regressed positions remain accurate after selection, yet the self-supervised objective provides no direct supervision or penalty on post-NMS localization error; this is load-bearing for the end-to-end training claim.
Authors: The regression head produces offsets that are applied before the consistency loss is evaluated, so gradients flow through the post-NMS positions. The siamese loss therefore directly penalizes inconsistency of the final selected points. We will add a clarifying sentence on this gradient path in the revision. revision: partial
-
Referee: [§4] §4 (experiments): the abstract asserts competitive or superior performance on HPatches but the provided text supplies no quantitative tables, error bars, dataset splits, or direct numerical comparisons to baselines such as SuperPoint; without these, the strength of the empirical support cannot be assessed.
Authors: Section 4 of the manuscript contains the requested tables with numerical comparisons on repeatability, localization error, matching score and homography estimation versus SuperPoint and other baselines. We will ensure the tables, any error statistics, and dataset details are clearly visible in the revised version. revision: yes
Circularity Check
No significant circularity; derivation relies on external benchmarks
full rationale
The paper describes a self-supervised siamese training procedure with custom losses on scores, positions, and uniformity. These losses are explicitly constructed to encourage repeatability and non-collapse, but the central performance claims (repeatability, localization, matching score, homography estimation) are measured on the external HPatch dataset after training. No quoted equations, self-citations, or derivation steps in the abstract reduce the reported results to the loss terms by construction; the method is validated against independent test data rather than being tautological with its training objective. This satisfies the criterion for a self-contained derivation against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption A siamese network trained with the described novel loss functions can learn consistent interest point scores and positions from unlabeled images alone.
Reference graph
Works this paper leans on
-
[1]
Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015
work page 2015
-
[2]
(AlexNet) im- agenet classification with deep convolutional neural net- works
A Krizhevsky, I Sutskever, and Ge Hinton. (AlexNet) im- agenet classification with deep convolutional neural net- works. Adv. Neural Inf. Process. Syst. , pages 1097–1105, 2012
work page 2012
-
[3]
Delving deep into rectifiers: Surpassing human-level per- formance on imagenet classification
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level per- formance on imagenet classification. InProceedings of the IEEE international conference on computer vision , pages 1026–1034, 2015
work page 2015
-
[4]
Dermatologist-level classification of skin cancer with deep neural networks
Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118, February 2017
work page 2017
-
[5]
Lip reading sentences in the wild
Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. November 2016
work page 2016
-
[6]
Cardiologist- Level arrhythmia detection with convolutional neural net- works
Pranav Rajpurkar, Awni Y Hannun, Masoumeh Hagh- panahi, Codie Bourn, and Andrew Y Ng. Cardiologist- Level arrhythmia detection with convolutional neural net- works. July 2017
work page 2017
-
[7]
Multiple View Geometry in Computer Vision
Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision . Cambridge University Press, 2003
work page 2003
-
[8]
Posenet: A convolutional network for real-time 6-dof cam- era relocalization
Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof cam- era relocalization. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 2938–2946, 2015
work page 2015
-
[9]
Geometric loss func- tions for camera pose regression with deep learning
Alex Kendall and Roberto Cipolla. Geometric loss func- tions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5974–5983, 2017
work page 2017
-
[10]
Deep image homography estimation
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Deep image homography estimation. June 2016
work page 2016
-
[11]
Efficient deep learning for stereo matching, 2016
Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Efficient deep learning for stereo matching, 2016
work page 2016
-
[12]
DeepVO: To- wards end-to-end visual odometry with deep recurrent con- volutional neural networks
S Wang, R Clark, H Wen, and N Trigoni. DeepVO: To- wards end-to-end visual odometry with deep recurrent con- volutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2043–2050, May 2017
work page 2017
-
[13]
Unsupervised monocular depth estimation with left- right consistency
Clément Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 270– 279, 2017
work page 2017
-
[14]
A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK
S A K Tareen and Z Saleem. A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK. In 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET) , pages 1–10, March 2018
work page 2018
-
[15]
Distinctive image features from Scale- Invariant keypoints, 2004
David G Lowe. Distinctive image features from Scale- Invariant keypoints, 2004
work page 2004
-
[16]
Speeded-Up robust features (SURF)
Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-Up robust features (SURF). Comput. Vis. Image Underst., 110(3):346–359, June 2008
work page 2008
-
[17]
ORB: An efficient alternative to SIFT or SURF
Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R Bradski. ORB: An efficient alternative to SIFT or SURF. In ICCV, volume 11, page 2, 2011
work page 2011
-
[18]
Fast explicit diffusion for accelerated features in nonlinear scale spaces, 2013
Pablo Alcantarilla, Jesus Nuevo, and Adrien Bartoli. Fast explicit diffusion for accelerated features in nonlinear scale spaces, 2013
work page 2013
-
[19]
BRISK: Binary robust invariant scalable keypoints
Stefan Leutenegger, Margarita Chli, and Roland Siegwart. BRISK: Binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV), pages 2548–2555, 2011
work page 2011
-
[20]
Bundle adjustment—a modern synthesis
Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pages 298–372, 1999
work page 1999
-
[21]
ORB- SLAM: A versatile and accurate monocular SLAM sys- tem
R Mur-Artal, J M M Montiel, and J D Tardós. ORB- SLAM: A versatile and accurate monocular SLAM sys- tem. IEEE Trans. Rob., 31(5):1147–1163, October 2015
work page 2015
-
[22]
ORB-SLAM2: An Open- Source SLAM system for monocular, stereo, and RGB- D cameras
R Mur-Artal and J D Tardós. ORB-SLAM2: An Open- Source SLAM system for monocular, stereo, and RGB- D cameras. IEEE Trans. Rob., 33(5):1255–1262, October 2017
work page 2017
-
[23]
Fast relocalisation and loop closing in keyframe-based SLAM
Raúl Mur-Artal and Juan D Tardós. Fast relocalisation and loop closing in keyframe-based SLAM. In 2014 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 846–853, 2014
work page 2014
-
[24]
Incremental fusion of Structure-from- Motion and GPS using constrained bundle adjustments
Maxime Lhuillier. Incremental fusion of Structure-from- Motion and GPS using constrained bundle adjustments. IEEE Trans. Pattern Anal. Mach. Intell. , 34(12):2489– 2495, December 2012
work page 2012
-
[25]
Keyframe-based visual-inertial slam using nonlinear op- timization
Stefan Leutenegger, Paul Furgale, Vincent Rabaud, Margarita Chli, Kurt Konolige, and Roland Siegwart. Keyframe-based visual-inertial slam using nonlinear op- timization. Proceedings of Robotis Science and Systems (RSS) 2013, 2013. 13 UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor
work page 2013
-
[26]
VINS-Mono: A robust and versa- tile monocular Visual-Inertial state estimator
T Qin, P Li, and S Shen. VINS-Mono: A robust and versa- tile monocular Visual-Inertial state estimator. IEEE Trans. Rob., 34(4):1004–1020, August 2018
work page 2018
-
[27]
Match- net: Unifying feature and metric learning for patch-based matching
X Han, T Leung, Y Jia, R Sukthankar, and others. Match- net: Unifying feature and metric learning for patch-based matching. Proc. IEEE, 2015
work page 2015
-
[28]
Discrim- inative learning of local image descriptors
Matthew Brown, Gang Hua, and Simon Winder. Discrim- inative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):43–57, January 2011
work page 2011
-
[29]
Dis- criminative learning of deep convolutional feature point descriptors
Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Dis- criminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Con- ference on Computer Vision, pages 118–126, 2015
work page 2015
-
[30]
Learning to compare image patches via convolutional neural networks
Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2015
work page 2015
-
[31]
PN-Net: Conjoined triple deep network for learning local image descriptors
Vassileios Balntas, Edward Johns, Lilian Tang, and Krys- tian Mikolajczyk. PN-Net: Conjoined triple deep network for learning local image descriptors. January 2016
work page 2016
-
[32]
L2-net: Deep learn- ing of discriminative patch descriptor in euclidean space
Yurun Tian, Bin Fan, and Fuchao Wu. L2-net: Deep learn- ing of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition, pages 661–669, 2017
work page 2017
-
[33]
TILDE: A temporally invariant learned DEtector
Yannick Verdie, Kwang Moo Yi, Pascal Fua, and Vincent Lepetit. TILDE: A temporally invariant learned DEtector. November 2014
work page 2014
-
[34]
Quad-networks: unsupervised learning to rank for interest point detection
Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sattler, and Marc Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. In Proceed- ings of the IEEE conference on computer vision and pat- tern recognition, pages 1822–1830, 2017
work page 2017
-
[35]
LIFT: Learned invariant feature transform
Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pas- cal Fua. LIFT: Learned invariant feature transform. In Computer Vision – ECCV 2016 , Lecture Notes in Com- puter Science, pages 467–483. Springer, Cham, October 2016
work page 2016
-
[36]
Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. June 2015
work page 2015
-
[37]
LF-Net: Learning local features from images
Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. LF-Net: Learning local features from images. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 6234–6244. Curran Asso- ciates, Inc., 2018
work page 2018
-
[38]
Superpoint: Self-supervised interest point detec- tion and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detec- tion and description. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition Work- shops, pages 224–236, 2018
work page 2018
-
[39]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[40]
Efficient adaptive non-maximal suppression algorithms for homo- geneous spatial keypoint distribution
Oleksandr Bailo, Francois Rameau, Kyungdon Joo, Jin- sun Park, Oleksandr Bogdan, and In So Kweon. Efficient adaptive non-maximal suppression algorithms for homo- geneous spatial keypoint distribution. Pattern Recognit. Lett., 106:53–60, April 2018
work page 2018
-
[41]
Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch. Computer software. V ers. 0. 3, 1, 2017
work page 2017
-
[42]
Microsoft COCO: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Computer Vision – ECCV 2014 , pages 740–
work page 2014
-
[43]
Springer International Publishing, 2014
work page 2014
-
[44]
Adam: A method for stochastic optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. December 2014
work page 2014
-
[45]
Evaluation of interest point detectors
Cordelia Schmid, Roger Mohr, and Christian Bauckhage. Evaluation of interest point detectors. Int. J. Comput. Vis., 37(2):151–172, June 2000
work page 2000
-
[46]
Martin A Fischler and Robert C Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981
work page 1981
-
[47]
HPatches: A benchmark and evaluation of handcrafted and learned local descriptors
Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceed- ings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 5173–5182, 2017
work page 2017
-
[48]
Daniel DeTone. Superpoint. https: //github.com/MagicLeapResearch/ SuperPointPretrainedNetwork
-
[49]
Yuki Ono. Lf-net. https://github.com/vcg-uvic/ lf-net-release
-
[50]
Identity mappings in deep residual networks
Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision – ECCV 2016, pages 630–645. Springer, Cham, Oc- tober 2016
work page 2016
-
[51]
Densely connected convolutional networks
Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Lau- rens van der Maaten. Densely connected convolutional networks. August 2016
work page 2016
-
[52]
Squeeze-and-Excitation networks
Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-Excitation networks. September 2017
work page 2017
-
[53]
Xception: Deep learning with depthwise separable convolutions
François Chollet. Xception: Deep learning with depthwise separable convolutions. October 2016
work page 2016
-
[54]
MobileNets: Efficient con- volutional neural networks for mobile vision applications
Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. MobileNets: Efficient con- volutional neural networks for mobile vision applications. April 2017
work page 2017
-
[55]
Fully convolutional networks for semantic segmentation, 2015
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation, 2015. 14
work page 2015
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.