UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor

Henrik Karstoft; Mikkel Fly Kragh; Peter Hviid Christiansen; Yury Brodskiy

arxiv: 1907.04011 · v1 · pith:ZP3OYAFFnew · submitted 2019-07-09 · 💻 cs.CV

UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor

Peter Hviid Christiansen , Mikkel Fly Kragh , Yury Brodskiy , Henrik Karstoft This is my paper

Pith reviewed 2026-05-25 00:49 UTC · model grok-4.3

classification 💻 cs.CV

keywords interest point detectionunsupervised learningsiamese networkfeature descriptorself-supervised learningcomputer visiondeep learninghomography estimation

0 comments

The pith

UnsuperPoint trains an interest point detector and descriptor unsupervised via a siamese network and novel loss functions without pseudo-labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Interest points lack consistent human annotations, so most detectors depend on generated pseudo ground truth or structure-from-motion data. This work introduces UnsuperPoint, which learns both detector and descriptor end-to-end from a self-supervised siamese network using losses on point scores, positions, and prediction uniformity. The approach requires only a single training round and no external representations. It runs in real time while matching or exceeding prior methods on repeatability, localization, matching score, and homography tasks in the HPatches benchmark.

Core claim

UnsuperPoint is an unsupervised deep learning interest point detector and descriptor learned through self-supervised siamese training with a novel loss function that automatically learns point scores and positions via regression, incorporates non-maximum suppression inside the model, and regularizes predictions to be uniformly distributed, all without generating pseudo ground truth points, using structure-from-motion representations, or performing multiple training rounds.

What carries the argument

Self-supervised siamese network with regression of point positions to enable end-to-end training and non-maximum suppression, plus a uniformity loss that regularizes network predictions.

If this is right

The detector and descriptor become end-to-end trainable without separate stages for pseudo-label creation.
Non-maximum suppression is handled inside the learned model rather than as a post-process.
The model achieves 323 fps at 224x320 resolution and 90 fps at 480x640 while remaining competitive on HPatches metrics.
Only one round of training is needed, avoiding iterative pseudo-ground-truth pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The uniformity loss could be adapted to other unsupervised feature tasks to prevent degenerate clustering of predictions.
Single-pass training lowers the barrier to experimenting with new detector architectures compared with multi-stage label-refinement methods.
Real-time operation supports direct deployment inside larger vision pipelines such as visual odometry on embedded hardware.

Load-bearing premise

The siamese self-supervision together with the new losses on scores, positions, and uniformity will produce repeatable and useful interest points without any external labels or pseudo-data.

What would settle it

Train the network once on the described losses and evaluate repeatability and matching score on HPatches image pairs; if performance remains substantially below supervised baselines and does not improve with the uniformity term, the unsupervised claim would not hold.

Figures

Figures reproduced from arXiv: 1907.04011 by Henrik Karstoft, Mikkel Fly Kragh, Peter Hviid Christiansen, Yury Brodskiy.

**Figure 1.** Figure 1: UnsuperPoint takes an input image and outputs an interest point vector. The score, position and descriptor [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗

**Figure 2.** Figure 2: The network predicts a single interest point position for each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Two permutations of the same image are forwarded through a siamese network. Corresponding points [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Scores and the distance between point-pairs [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Histogram of x-coordinate position predictions (a) Histogram of distributions (b) Ascendingly sorted values [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: Uniform, Gaussian and clipped Gaussian distribution centered around 0.5 in the range 0-1. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗

**Figure 7.** Figure 7: Image matching for homography estimation. The detector generates point positions and descriptors for two [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗

**Figure 8.** Figure 8: Homography error (HE) is the mean distance between corners of the target image after being transformed [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗

**Figure 9.** Figure 9: Interest point prediction on reference and target frame for small image motion. Predictions from the refer [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 10.** Figure 10: Filtered matches from UnsuperPoint for small and large motion examples. Matches are represented with [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

read the original abstract

It is hard to create consistent ground truth data for interest points in natural images, since interest points are hard to define clearly and consistently for a human annotator. This makes interest point detectors non-trivial to build. In this work, we introduce an unsupervised deep learning-based interest point detector and descriptor. Using a self-supervised approach, we utilize a siamese network and a novel loss function that enables interest point scores and positions to be learned automatically. The resulting interest point detector and descriptor is UnsuperPoint. We use regression of point positions to 1) make UnsuperPoint end-to-end trainable and 2) to incorporate non-maximum suppression in the model. Unlike most trainable detectors, it requires no generation of pseudo ground truth points, no structure-from-motion-generated representations and the model is learned from only one round of training. Furthermore, we introduce a novel loss function to regularize network predictions to be uniformly distributed. UnsuperPoint runs in real-time with 323 frames per second (fps) at a resolution of $224\times320$ and 90 fps at $480\times640$. It is comparable or better than state-of-the-art performance when measured for speed, repeatability, localization, matching score and homography estimation on the HPatch dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

UnsuperPoint delivers a clean self-supervised siamese detector with a uniform loss and position regression for NMS, but the repeatability claims rest on losses that may not enforce cross-view stability.

read the letter

The core contribution is a fully unsupervised interest point detector and descriptor trained end-to-end on siamese pairs. It avoids pseudo-ground-truth generation and SfM by regressing point positions (which folds NMS into the forward pass) and adds a loss that pushes predictions toward uniform distribution. That combination is the actual novelty, and the reported 323 fps at 224x320 with competitive HPatches numbers on repeatability, localization, and homography is the practical payoff. The single-round training claim also removes one common source of pipeline complexity. Those pieces are concrete and worth looking at if you need a fast, label-free baseline. The soft spot is exactly the one flagged in the stress-test note. The uniform loss stops density collapse, but nothing in the objective directly penalizes points that latch onto low-level artifacts that happen to be consistent within the training pairs yet drift under real viewpoint change. Without external verification or an ablation that isolates the position regression after NMS, it is not obvious the surviving points are stable rather than merely repeatable on the training distribution. The paper shows the method works on HPatches, but the gap between “no pseudo-labels” and “truly unsupervised invariance” is still bridged by the loss design itself. This is the kind of paper that belongs in a reading group focused on self-supervised feature learning. A serious editor should send it to review; the architectural choices and speed results are solid enough to justify referee time even if the experiments need tightening on the repeatability mechanism.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces UnsuperPoint, an end-to-end unsupervised interest point detector and descriptor. It employs a siamese network trained with novel losses on point scores, regressed positions (to enable end-to-end differentiability and incorporate NMS), and a uniform distribution regularizer. The approach requires no pseudo-ground-truth generation or SfM, uses only a single training round, runs at real-time speeds (323 fps at 224×320), and claims performance comparable or superior to prior methods on repeatability, localization error, matching score, and homography estimation on the HPatches dataset.

Significance. If the empirical claims hold under rigorous validation, the work would be significant for demonstrating a fully unsupervised, single-stage pipeline that avoids common dependencies on pseudo-labels or multi-view geometry, thereby simplifying training of repeatable detectors and descriptors while maintaining practical speed.

major comments (3)

[§3] §3 (loss functions): the uniform-distribution term prevents density collapse but supplies no explicit mechanism to enforce repeatability or viewpoint invariance; the siamese consistency objective alone may be satisfied by low-level image artifacts that happen to be stable in the training distribution rather than by semantically useful points. This assumption is load-bearing for the central claim that the losses suffice without external supervision.
[§3.2] §3.2 (position regression): folding NMS into the model via learned offsets assumes the regressed positions remain accurate after selection, yet the self-supervised objective provides no direct supervision or penalty on post-NMS localization error; this is load-bearing for the end-to-end training claim.
[§4] §4 (experiments): the abstract asserts competitive or superior performance on HPatches but the provided text supplies no quantitative tables, error bars, dataset splits, or direct numerical comparisons to baselines such as SuperPoint; without these, the strength of the empirical support cannot be assessed.

minor comments (2)

[Abstract] Abstract: key numerical results (repeatability scores, matching scores, etc.) should be included to allow readers to evaluate the performance claims immediately.
[§3] Notation: the precise formulation of the three loss terms and their weighting hyperparameters should be stated explicitly with equation numbers for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point to the major concerns below.

read point-by-point responses

Referee: [§3] §3 (loss functions): the uniform-distribution term prevents density collapse but supplies no explicit mechanism to enforce repeatability or viewpoint invariance; the siamese consistency objective alone may be satisfied by low-level image artifacts that happen to be stable in the training distribution rather than by semantically useful points. This assumption is load-bearing for the central claim that the losses suffice without external supervision.

Authors: The siamese consistency loss is computed on image pairs related by random homographies that simulate viewpoint variation. This forces both point scores and descriptors to be consistent under geometric change, which goes beyond stable low-level artifacts. The uniform regularizer further discourages collapse to trivial solutions. Empirical results on HPatches (repeatability, matching score, homography estimation) support that the learned points are useful rather than artifact-driven. revision: no
Referee: [§3.2] §3.2 (position regression): folding NMS into the model via learned offsets assumes the regressed positions remain accurate after selection, yet the self-supervised objective provides no direct supervision or penalty on post-NMS localization error; this is load-bearing for the end-to-end training claim.

Authors: The regression head produces offsets that are applied before the consistency loss is evaluated, so gradients flow through the post-NMS positions. The siamese loss therefore directly penalizes inconsistency of the final selected points. We will add a clarifying sentence on this gradient path in the revision. revision: partial
Referee: [§4] §4 (experiments): the abstract asserts competitive or superior performance on HPatches but the provided text supplies no quantitative tables, error bars, dataset splits, or direct numerical comparisons to baselines such as SuperPoint; without these, the strength of the empirical support cannot be assessed.

Authors: Section 4 of the manuscript contains the requested tables with numerical comparisons on repeatability, localization error, matching score and homography estimation versus SuperPoint and other baselines. We will ensure the tables, any error statistics, and dataset details are clearly visible in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external benchmarks

full rationale

The paper describes a self-supervised siamese training procedure with custom losses on scores, positions, and uniformity. These losses are explicitly constructed to encourage repeatability and non-collapse, but the central performance claims (repeatability, localization, matching score, homography estimation) are measured on the external HPatch dataset after training. No quoted equations, self-citations, or derivation steps in the abstract reduce the reported results to the loss terms by construction; the method is validated against independent test data rather than being tautological with its training objective. This satisfies the criterion for a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal concrete parameters or entities; the central assumption is the effectiveness of the self-supervised loss.

axioms (1)

domain assumption A siamese network trained with the described novel loss functions can learn consistent interest point scores and positions from unlabeled images alone.
This is the core premise of the self-supervised training approach.

pith-pipeline@v0.9.0 · 5767 in / 1200 out tokens · 33322 ms · 2026-05-25T00:49:27.716919+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

[1]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015

work page 2015
[2]

(AlexNet) im- agenet classiﬁcation with deep convolutional neural net- works

A Krizhevsky, I Sutskever, and Ge Hinton. (AlexNet) im- agenet classiﬁcation with deep convolutional neural net- works. Adv. Neural Inf. Process. Syst. , pages 1097–1105, 2012

work page 2012
[3]

Delving deep into rectiﬁers: Surpassing human-level per- formance on imagenet classiﬁcation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level per- formance on imagenet classiﬁcation. InProceedings of the IEEE international conference on computer vision , pages 1026–1034, 2015

work page 2015
[4]

Dermatologist-level classiﬁcation of skin cancer with deep neural networks

Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classiﬁcation of skin cancer with deep neural networks. Nature, 542(7639):115–118, February 2017

work page 2017
[5]

Lip reading sentences in the wild

Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. November 2016

work page 2016
[6]

Cardiologist- Level arrhythmia detection with convolutional neural net- works

Pranav Rajpurkar, Awni Y Hannun, Masoumeh Hagh- panahi, Codie Bourn, and Andrew Y Ng. Cardiologist- Level arrhythmia detection with convolutional neural net- works. July 2017

work page 2017
[7]

Multiple View Geometry in Computer Vision

Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision . Cambridge University Press, 2003

work page 2003
[8]

Posenet: A convolutional network for real-time 6-dof cam- era relocalization

Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof cam- era relocalization. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 2938–2946, 2015

work page 2015
[9]

Geometric loss func- tions for camera pose regression with deep learning

Alex Kendall and Roberto Cipolla. Geometric loss func- tions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5974–5983, 2017

work page 2017
[10]

Deep image homography estimation

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Deep image homography estimation. June 2016

work page 2016
[11]

Efﬁcient deep learning for stereo matching, 2016

Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Efﬁcient deep learning for stereo matching, 2016

work page 2016
[12]

DeepVO: To- wards end-to-end visual odometry with deep recurrent con- volutional neural networks

S Wang, R Clark, H Wen, and N Trigoni. DeepVO: To- wards end-to-end visual odometry with deep recurrent con- volutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2043–2050, May 2017

work page 2017
[13]

Unsupervised monocular depth estimation with left- right consistency

Clément Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 270– 279, 2017

work page 2017
[14]

A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK

S A K Tareen and Z Saleem. A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK. In 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET) , pages 1–10, March 2018

work page 2018
[15]

Distinctive image features from Scale- Invariant keypoints, 2004

David G Lowe. Distinctive image features from Scale- Invariant keypoints, 2004

work page 2004
[16]

Speeded-Up robust features (SURF)

Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-Up robust features (SURF). Comput. Vis. Image Underst., 110(3):346–359, June 2008

work page 2008
[17]

ORB: An efﬁcient alternative to SIFT or SURF

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R Bradski. ORB: An efﬁcient alternative to SIFT or SURF. In ICCV, volume 11, page 2, 2011

work page 2011
[18]

Fast explicit diffusion for accelerated features in nonlinear scale spaces, 2013

Pablo Alcantarilla, Jesus Nuevo, and Adrien Bartoli. Fast explicit diffusion for accelerated features in nonlinear scale spaces, 2013

work page 2013
[19]

BRISK: Binary robust invariant scalable keypoints

Stefan Leutenegger, Margarita Chli, and Roland Siegwart. BRISK: Binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV), pages 2548–2555, 2011

work page 2011
[20]

Bundle adjustment—a modern synthesis

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pages 298–372, 1999

work page 1999
[21]

ORB- SLAM: A versatile and accurate monocular SLAM sys- tem

R Mur-Artal, J M M Montiel, and J D Tardós. ORB- SLAM: A versatile and accurate monocular SLAM sys- tem. IEEE Trans. Rob., 31(5):1147–1163, October 2015

work page 2015
[22]

ORB-SLAM2: An Open- Source SLAM system for monocular, stereo, and RGB- D cameras

R Mur-Artal and J D Tardós. ORB-SLAM2: An Open- Source SLAM system for monocular, stereo, and RGB- D cameras. IEEE Trans. Rob., 33(5):1255–1262, October 2017

work page 2017
[23]

Fast relocalisation and loop closing in keyframe-based SLAM

Raúl Mur-Artal and Juan D Tardós. Fast relocalisation and loop closing in keyframe-based SLAM. In 2014 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 846–853, 2014

work page 2014
[24]

Incremental fusion of Structure-from- Motion and GPS using constrained bundle adjustments

Maxime Lhuillier. Incremental fusion of Structure-from- Motion and GPS using constrained bundle adjustments. IEEE Trans. Pattern Anal. Mach. Intell. , 34(12):2489– 2495, December 2012

work page 2012
[25]

Keyframe-based visual-inertial slam using nonlinear op- timization

Stefan Leutenegger, Paul Furgale, Vincent Rabaud, Margarita Chli, Kurt Konolige, and Roland Siegwart. Keyframe-based visual-inertial slam using nonlinear op- timization. Proceedings of Robotis Science and Systems (RSS) 2013, 2013. 13 UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor

work page 2013
[26]

VINS-Mono: A robust and versa- tile monocular Visual-Inertial state estimator

T Qin, P Li, and S Shen. VINS-Mono: A robust and versa- tile monocular Visual-Inertial state estimator. IEEE Trans. Rob., 34(4):1004–1020, August 2018

work page 2018
[27]

Match- net: Unifying feature and metric learning for patch-based matching

X Han, T Leung, Y Jia, R Sukthankar, and others. Match- net: Unifying feature and metric learning for patch-based matching. Proc. IEEE, 2015

work page 2015
[28]

Discrim- inative learning of local image descriptors

Matthew Brown, Gang Hua, and Simon Winder. Discrim- inative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):43–57, January 2011

work page 2011
[29]

Dis- criminative learning of deep convolutional feature point descriptors

Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Dis- criminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Con- ference on Computer Vision, pages 118–126, 2015

work page 2015
[30]

Learning to compare image patches via convolutional neural networks

Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2015

work page 2015
[31]

PN-Net: Conjoined triple deep network for learning local image descriptors

Vassileios Balntas, Edward Johns, Lilian Tang, and Krys- tian Mikolajczyk. PN-Net: Conjoined triple deep network for learning local image descriptors. January 2016

work page 2016
[32]

L2-net: Deep learn- ing of discriminative patch descriptor in euclidean space

Yurun Tian, Bin Fan, and Fuchao Wu. L2-net: Deep learn- ing of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition, pages 661–669, 2017

work page 2017
[33]

TILDE: A temporally invariant learned DEtector

Yannick Verdie, Kwang Moo Yi, Pascal Fua, and Vincent Lepetit. TILDE: A temporally invariant learned DEtector. November 2014

work page 2014
[34]

Quad-networks: unsupervised learning to rank for interest point detection

Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sattler, and Marc Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. In Proceed- ings of the IEEE conference on computer vision and pat- tern recognition, pages 1822–1830, 2017

work page 2017
[35]

LIFT: Learned invariant feature transform

Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pas- cal Fua. LIFT: Learned invariant feature transform. In Computer Vision – ECCV 2016 , Lecture Notes in Com- puter Science, pages 467–483. Springer, Cham, October 2016

work page 2016
[36]

Spatial transformer networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. June 2015

work page 2015
[37]

LF-Net: Learning local features from images

Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. LF-Net: Learning local features from images. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 6234–6244. Curran Asso- ciates, Inc., 2018

work page 2018
[38]

Superpoint: Self-supervised interest point detec- tion and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detec- tion and description. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition Work- shops, pages 224–236, 2018

work page 2018
[39]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[40]

Efﬁcient adaptive non-maximal suppression algorithms for homo- geneous spatial keypoint distribution

Oleksandr Bailo, Francois Rameau, Kyungdon Joo, Jin- sun Park, Oleksandr Bogdan, and In So Kweon. Efﬁcient adaptive non-maximal suppression algorithms for homo- geneous spatial keypoint distribution. Pattern Recognit. Lett., 106:53–60, April 2018

work page 2018
[41]

Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch. Computer software. V ers. 0. 3, 1, 2017

work page 2017
[42]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Computer Vision – ECCV 2014 , pages 740–

work page 2014
[43]

Springer International Publishing, 2014

work page 2014
[44]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. December 2014

work page 2014
[45]

Evaluation of interest point detectors

Cordelia Schmid, Roger Mohr, and Christian Bauckhage. Evaluation of interest point detectors. Int. J. Comput. Vis., 37(2):151–172, June 2000

work page 2000
[46]

Random sample consensus: A paradigm for model ﬁtting with applications to image analysis and automated cartography

Martin A Fischler and Robert C Bolles. Random sample consensus: A paradigm for model ﬁtting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981

work page 1981
[47]

HPatches: A benchmark and evaluation of handcrafted and learned local descriptors

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceed- ings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 5173–5182, 2017

work page 2017
[48]

Superpoint

Daniel DeTone. Superpoint. https: //github.com/MagicLeapResearch/ SuperPointPretrainedNetwork

work page
[49]

Yuki Ono. Lf-net. https://github.com/vcg-uvic/ lf-net-release

work page
[50]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision – ECCV 2016, pages 630–645. Springer, Cham, Oc- tober 2016

work page 2016
[51]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Lau- rens van der Maaten. Densely connected convolutional networks. August 2016

work page 2016
[52]

Squeeze-and-Excitation networks

Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-Excitation networks. September 2017

work page 2017
[53]

Xception: Deep learning with depthwise separable convolutions

François Chollet. Xception: Deep learning with depthwise separable convolutions. October 2016

work page 2016
[54]

MobileNets: Efﬁcient con- volutional neural networks for mobile vision applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. MobileNets: Efﬁcient con- volutional neural networks for mobile vision applications. April 2017

work page 2017
[55]

Fully convolutional networks for semantic segmentation, 2015

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation, 2015. 14

work page 2015

[1] [1]

Deep learning

Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015

work page 2015

[2] [2]

(AlexNet) im- agenet classiﬁcation with deep convolutional neural net- works

A Krizhevsky, I Sutskever, and Ge Hinton. (AlexNet) im- agenet classiﬁcation with deep convolutional neural net- works. Adv. Neural Inf. Process. Syst. , pages 1097–1105, 2012

work page 2012

[3] [3]

Delving deep into rectiﬁers: Surpassing human-level per- formance on imagenet classiﬁcation

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectiﬁers: Surpassing human-level per- formance on imagenet classiﬁcation. InProceedings of the IEEE international conference on computer vision , pages 1026–1034, 2015

work page 2015

[4] [4]

Dermatologist-level classiﬁcation of skin cancer with deep neural networks

Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classiﬁcation of skin cancer with deep neural networks. Nature, 542(7639):115–118, February 2017

work page 2017

[5] [5]

Lip reading sentences in the wild

Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. November 2016

work page 2016

[6] [6]

Cardiologist- Level arrhythmia detection with convolutional neural net- works

Pranav Rajpurkar, Awni Y Hannun, Masoumeh Hagh- panahi, Codie Bourn, and Andrew Y Ng. Cardiologist- Level arrhythmia detection with convolutional neural net- works. July 2017

work page 2017

[7] [7]

Multiple View Geometry in Computer Vision

Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision . Cambridge University Press, 2003

work page 2003

[8] [8]

Posenet: A convolutional network for real-time 6-dof cam- era relocalization

Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof cam- era relocalization. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 2938–2946, 2015

work page 2015

[9] [9]

Geometric loss func- tions for camera pose regression with deep learning

Alex Kendall and Roberto Cipolla. Geometric loss func- tions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5974–5983, 2017

work page 2017

[10] [10]

Deep image homography estimation

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Deep image homography estimation. June 2016

work page 2016

[11] [11]

Efﬁcient deep learning for stereo matching, 2016

Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Efﬁcient deep learning for stereo matching, 2016

work page 2016

[12] [12]

DeepVO: To- wards end-to-end visual odometry with deep recurrent con- volutional neural networks

S Wang, R Clark, H Wen, and N Trigoni. DeepVO: To- wards end-to-end visual odometry with deep recurrent con- volutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2043–2050, May 2017

work page 2017

[13] [13]

Unsupervised monocular depth estimation with left- right consistency

Clément Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 270– 279, 2017

work page 2017

[14] [14]

A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK

S A K Tareen and Z Saleem. A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK. In 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET) , pages 1–10, March 2018

work page 2018

[15] [15]

Distinctive image features from Scale- Invariant keypoints, 2004

David G Lowe. Distinctive image features from Scale- Invariant keypoints, 2004

work page 2004

[16] [16]

Speeded-Up robust features (SURF)

Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-Up robust features (SURF). Comput. Vis. Image Underst., 110(3):346–359, June 2008

work page 2008

[17] [17]

ORB: An efﬁcient alternative to SIFT or SURF

Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R Bradski. ORB: An efﬁcient alternative to SIFT or SURF. In ICCV, volume 11, page 2, 2011

work page 2011

[18] [18]

Fast explicit diffusion for accelerated features in nonlinear scale spaces, 2013

Pablo Alcantarilla, Jesus Nuevo, and Adrien Bartoli. Fast explicit diffusion for accelerated features in nonlinear scale spaces, 2013

work page 2013

[19] [19]

BRISK: Binary robust invariant scalable keypoints

Stefan Leutenegger, Margarita Chli, and Roland Siegwart. BRISK: Binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV), pages 2548–2555, 2011

work page 2011

[20] [20]

Bundle adjustment—a modern synthesis

Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pages 298–372, 1999

work page 1999

[21] [21]

ORB- SLAM: A versatile and accurate monocular SLAM sys- tem

R Mur-Artal, J M M Montiel, and J D Tardós. ORB- SLAM: A versatile and accurate monocular SLAM sys- tem. IEEE Trans. Rob., 31(5):1147–1163, October 2015

work page 2015

[22] [22]

ORB-SLAM2: An Open- Source SLAM system for monocular, stereo, and RGB- D cameras

R Mur-Artal and J D Tardós. ORB-SLAM2: An Open- Source SLAM system for monocular, stereo, and RGB- D cameras. IEEE Trans. Rob., 33(5):1255–1262, October 2017

work page 2017

[23] [23]

Fast relocalisation and loop closing in keyframe-based SLAM

Raúl Mur-Artal and Juan D Tardós. Fast relocalisation and loop closing in keyframe-based SLAM. In 2014 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 846–853, 2014

work page 2014

[24] [24]

Incremental fusion of Structure-from- Motion and GPS using constrained bundle adjustments

Maxime Lhuillier. Incremental fusion of Structure-from- Motion and GPS using constrained bundle adjustments. IEEE Trans. Pattern Anal. Mach. Intell. , 34(12):2489– 2495, December 2012

work page 2012

[25] [25]

Keyframe-based visual-inertial slam using nonlinear op- timization

Stefan Leutenegger, Paul Furgale, Vincent Rabaud, Margarita Chli, Kurt Konolige, and Roland Siegwart. Keyframe-based visual-inertial slam using nonlinear op- timization. Proceedings of Robotis Science and Systems (RSS) 2013, 2013. 13 UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor

work page 2013

[26] [26]

VINS-Mono: A robust and versa- tile monocular Visual-Inertial state estimator

T Qin, P Li, and S Shen. VINS-Mono: A robust and versa- tile monocular Visual-Inertial state estimator. IEEE Trans. Rob., 34(4):1004–1020, August 2018

work page 2018

[27] [27]

Match- net: Unifying feature and metric learning for patch-based matching

X Han, T Leung, Y Jia, R Sukthankar, and others. Match- net: Unifying feature and metric learning for patch-based matching. Proc. IEEE, 2015

work page 2015

[28] [28]

Discrim- inative learning of local image descriptors

Matthew Brown, Gang Hua, and Simon Winder. Discrim- inative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):43–57, January 2011

work page 2011

[29] [29]

Dis- criminative learning of deep convolutional feature point descriptors

Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Dis- criminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Con- ference on Computer Vision, pages 118–126, 2015

work page 2015

[30] [30]

Learning to compare image patches via convolutional neural networks

Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2015

work page 2015

[31] [31]

PN-Net: Conjoined triple deep network for learning local image descriptors

Vassileios Balntas, Edward Johns, Lilian Tang, and Krys- tian Mikolajczyk. PN-Net: Conjoined triple deep network for learning local image descriptors. January 2016

work page 2016

[32] [32]

L2-net: Deep learn- ing of discriminative patch descriptor in euclidean space

Yurun Tian, Bin Fan, and Fuchao Wu. L2-net: Deep learn- ing of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition, pages 661–669, 2017

work page 2017

[33] [33]

TILDE: A temporally invariant learned DEtector

Yannick Verdie, Kwang Moo Yi, Pascal Fua, and Vincent Lepetit. TILDE: A temporally invariant learned DEtector. November 2014

work page 2014

[34] [34]

Quad-networks: unsupervised learning to rank for interest point detection

Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sattler, and Marc Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. In Proceed- ings of the IEEE conference on computer vision and pat- tern recognition, pages 1822–1830, 2017

work page 2017

[35] [35]

LIFT: Learned invariant feature transform

Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pas- cal Fua. LIFT: Learned invariant feature transform. In Computer Vision – ECCV 2016 , Lecture Notes in Com- puter Science, pages 467–483. Springer, Cham, October 2016

work page 2016

[36] [36]

Spatial transformer networks

Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. June 2015

work page 2015

[37] [37]

LF-Net: Learning local features from images

Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. LF-Net: Learning local features from images. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 6234–6244. Curran Asso- ciates, Inc., 2018

work page 2018

[38] [38]

Superpoint: Self-supervised interest point detec- tion and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detec- tion and description. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition Work- shops, pages 224–236, 2018

work page 2018

[39] [39]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[40] [40]

Efﬁcient adaptive non-maximal suppression algorithms for homo- geneous spatial keypoint distribution

Oleksandr Bailo, Francois Rameau, Kyungdon Joo, Jin- sun Park, Oleksandr Bogdan, and In So Kweon. Efﬁcient adaptive non-maximal suppression algorithms for homo- geneous spatial keypoint distribution. Pattern Recognit. Lett., 106:53–60, April 2018

work page 2018

[41] [41]

Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch. Computer software. V ers. 0. 3, 1, 2017

work page 2017

[42] [42]

Microsoft COCO: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Computer Vision – ECCV 2014 , pages 740–

work page 2014

[43] [43]

Springer International Publishing, 2014

work page 2014

[44] [44]

Adam: A method for stochastic optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. December 2014

work page 2014

[45] [45]

Evaluation of interest point detectors

Cordelia Schmid, Roger Mohr, and Christian Bauckhage. Evaluation of interest point detectors. Int. J. Comput. Vis., 37(2):151–172, June 2000

work page 2000

[46] [46]

Random sample consensus: A paradigm for model ﬁtting with applications to image analysis and automated cartography

Martin A Fischler and Robert C Bolles. Random sample consensus: A paradigm for model ﬁtting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981

work page 1981

[47] [47]

HPatches: A benchmark and evaluation of handcrafted and learned local descriptors

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceed- ings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 5173–5182, 2017

work page 2017

[48] [48]

Superpoint

Daniel DeTone. Superpoint. https: //github.com/MagicLeapResearch/ SuperPointPretrainedNetwork

work page

[49] [49]

Yuki Ono. Lf-net. https://github.com/vcg-uvic/ lf-net-release

work page

[50] [50]

Identity mappings in deep residual networks

Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision – ECCV 2016, pages 630–645. Springer, Cham, Oc- tober 2016

work page 2016

[51] [51]

Densely connected convolutional networks

Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Lau- rens van der Maaten. Densely connected convolutional networks. August 2016

work page 2016

[52] [52]

Squeeze-and-Excitation networks

Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-Excitation networks. September 2017

work page 2017

[53] [53]

Xception: Deep learning with depthwise separable convolutions

François Chollet. Xception: Deep learning with depthwise separable convolutions. October 2016

work page 2016

[54] [54]

MobileNets: Efﬁcient con- volutional neural networks for mobile vision applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. MobileNets: Efﬁcient con- volutional neural networks for mobile vision applications. April 2017

work page 2017

[55] [55]

Fully convolutional networks for semantic segmentation, 2015

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation, 2015. 14

work page 2015