pith. sign in

arxiv: 1907.04011 · v1 · pith:ZP3OYAFFnew · submitted 2019-07-09 · 💻 cs.CV

UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor

Pith reviewed 2026-05-25 00:49 UTC · model grok-4.3

classification 💻 cs.CV
keywords interest point detectionunsupervised learningsiamese networkfeature descriptorself-supervised learningcomputer visiondeep learninghomography estimation
0
0 comments X

The pith

UnsuperPoint trains an interest point detector and descriptor unsupervised via a siamese network and novel loss functions without pseudo-labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Interest points lack consistent human annotations, so most detectors depend on generated pseudo ground truth or structure-from-motion data. This work introduces UnsuperPoint, which learns both detector and descriptor end-to-end from a self-supervised siamese network using losses on point scores, positions, and prediction uniformity. The approach requires only a single training round and no external representations. It runs in real time while matching or exceeding prior methods on repeatability, localization, matching score, and homography tasks in the HPatches benchmark.

Core claim

UnsuperPoint is an unsupervised deep learning interest point detector and descriptor learned through self-supervised siamese training with a novel loss function that automatically learns point scores and positions via regression, incorporates non-maximum suppression inside the model, and regularizes predictions to be uniformly distributed, all without generating pseudo ground truth points, using structure-from-motion representations, or performing multiple training rounds.

What carries the argument

Self-supervised siamese network with regression of point positions to enable end-to-end training and non-maximum suppression, plus a uniformity loss that regularizes network predictions.

If this is right

  • The detector and descriptor become end-to-end trainable without separate stages for pseudo-label creation.
  • Non-maximum suppression is handled inside the learned model rather than as a post-process.
  • The model achieves 323 fps at 224x320 resolution and 90 fps at 480x640 while remaining competitive on HPatches metrics.
  • Only one round of training is needed, avoiding iterative pseudo-ground-truth pipelines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The uniformity loss could be adapted to other unsupervised feature tasks to prevent degenerate clustering of predictions.
  • Single-pass training lowers the barrier to experimenting with new detector architectures compared with multi-stage label-refinement methods.
  • Real-time operation supports direct deployment inside larger vision pipelines such as visual odometry on embedded hardware.

Load-bearing premise

The siamese self-supervision together with the new losses on scores, positions, and uniformity will produce repeatable and useful interest points without any external labels or pseudo-data.

What would settle it

Train the network once on the described losses and evaluate repeatability and matching score on HPatches image pairs; if performance remains substantially below supervised baselines and does not improve with the uniformity term, the unsupervised claim would not hold.

Figures

Figures reproduced from arXiv: 1907.04011 by Henrik Karstoft, Mikkel Fly Kragh, Peter Hviid Christiansen, Yury Brodskiy.

Figure 1
Figure 1. Figure 1: UnsuperPoint takes an input image and outputs an interest point vector. The score, position and descriptor [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The network predicts a single interest point position for each [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Two permutations of the same image are forwarded through a siamese network. Corresponding points [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Scores and the distance between point-pairs [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Histogram of x-coordinate position predictions (a) Histogram of distributions (b) Ascendingly sorted values [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Uniform, Gaussian and clipped Gaussian distribution centered around 0.5 in the range 0-1. [PITH_FULL_IMAGE:figures/full_fig_p007_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Image matching for homography estimation. The detector generates point positions and descriptors for two [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Homography error (HE) is the mean distance between corners of the target image after being transformed [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Interest point prediction on reference and target frame for small image motion. Predictions from the refer [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Filtered matches from UnsuperPoint for small and large motion examples. Matches are represented with [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
read the original abstract

It is hard to create consistent ground truth data for interest points in natural images, since interest points are hard to define clearly and consistently for a human annotator. This makes interest point detectors non-trivial to build. In this work, we introduce an unsupervised deep learning-based interest point detector and descriptor. Using a self-supervised approach, we utilize a siamese network and a novel loss function that enables interest point scores and positions to be learned automatically. The resulting interest point detector and descriptor is UnsuperPoint. We use regression of point positions to 1) make UnsuperPoint end-to-end trainable and 2) to incorporate non-maximum suppression in the model. Unlike most trainable detectors, it requires no generation of pseudo ground truth points, no structure-from-motion-generated representations and the model is learned from only one round of training. Furthermore, we introduce a novel loss function to regularize network predictions to be uniformly distributed. UnsuperPoint runs in real-time with 323 frames per second (fps) at a resolution of $224\times320$ and 90 fps at $480\times640$. It is comparable or better than state-of-the-art performance when measured for speed, repeatability, localization, matching score and homography estimation on the HPatch dataset.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces UnsuperPoint, an end-to-end unsupervised interest point detector and descriptor. It employs a siamese network trained with novel losses on point scores, regressed positions (to enable end-to-end differentiability and incorporate NMS), and a uniform distribution regularizer. The approach requires no pseudo-ground-truth generation or SfM, uses only a single training round, runs at real-time speeds (323 fps at 224×320), and claims performance comparable or superior to prior methods on repeatability, localization error, matching score, and homography estimation on the HPatches dataset.

Significance. If the empirical claims hold under rigorous validation, the work would be significant for demonstrating a fully unsupervised, single-stage pipeline that avoids common dependencies on pseudo-labels or multi-view geometry, thereby simplifying training of repeatable detectors and descriptors while maintaining practical speed.

major comments (3)
  1. [§3] §3 (loss functions): the uniform-distribution term prevents density collapse but supplies no explicit mechanism to enforce repeatability or viewpoint invariance; the siamese consistency objective alone may be satisfied by low-level image artifacts that happen to be stable in the training distribution rather than by semantically useful points. This assumption is load-bearing for the central claim that the losses suffice without external supervision.
  2. [§3.2] §3.2 (position regression): folding NMS into the model via learned offsets assumes the regressed positions remain accurate after selection, yet the self-supervised objective provides no direct supervision or penalty on post-NMS localization error; this is load-bearing for the end-to-end training claim.
  3. [§4] §4 (experiments): the abstract asserts competitive or superior performance on HPatches but the provided text supplies no quantitative tables, error bars, dataset splits, or direct numerical comparisons to baselines such as SuperPoint; without these, the strength of the empirical support cannot be assessed.
minor comments (2)
  1. [Abstract] Abstract: key numerical results (repeatability scores, matching scores, etc.) should be included to allow readers to evaluate the performance claims immediately.
  2. [§3] Notation: the precise formulation of the three loss terms and their weighting hyperparameters should be stated explicitly with equation numbers for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive comments. We respond point-by-point to the major concerns below.

read point-by-point responses
  1. Referee: [§3] §3 (loss functions): the uniform-distribution term prevents density collapse but supplies no explicit mechanism to enforce repeatability or viewpoint invariance; the siamese consistency objective alone may be satisfied by low-level image artifacts that happen to be stable in the training distribution rather than by semantically useful points. This assumption is load-bearing for the central claim that the losses suffice without external supervision.

    Authors: The siamese consistency loss is computed on image pairs related by random homographies that simulate viewpoint variation. This forces both point scores and descriptors to be consistent under geometric change, which goes beyond stable low-level artifacts. The uniform regularizer further discourages collapse to trivial solutions. Empirical results on HPatches (repeatability, matching score, homography estimation) support that the learned points are useful rather than artifact-driven. revision: no

  2. Referee: [§3.2] §3.2 (position regression): folding NMS into the model via learned offsets assumes the regressed positions remain accurate after selection, yet the self-supervised objective provides no direct supervision or penalty on post-NMS localization error; this is load-bearing for the end-to-end training claim.

    Authors: The regression head produces offsets that are applied before the consistency loss is evaluated, so gradients flow through the post-NMS positions. The siamese loss therefore directly penalizes inconsistency of the final selected points. We will add a clarifying sentence on this gradient path in the revision. revision: partial

  3. Referee: [§4] §4 (experiments): the abstract asserts competitive or superior performance on HPatches but the provided text supplies no quantitative tables, error bars, dataset splits, or direct numerical comparisons to baselines such as SuperPoint; without these, the strength of the empirical support cannot be assessed.

    Authors: Section 4 of the manuscript contains the requested tables with numerical comparisons on repeatability, localization error, matching score and homography estimation versus SuperPoint and other baselines. We will ensure the tables, any error statistics, and dataset details are clearly visible in the revised version. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation relies on external benchmarks

full rationale

The paper describes a self-supervised siamese training procedure with custom losses on scores, positions, and uniformity. These losses are explicitly constructed to encourage repeatability and non-collapse, but the central performance claims (repeatability, localization, matching score, homography estimation) are measured on the external HPatch dataset after training. No quoted equations, self-citations, or derivation steps in the abstract reduce the reported results to the loss terms by construction; the method is validated against independent test data rather than being tautological with its training objective. This satisfies the criterion for a self-contained derivation against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review yields minimal concrete parameters or entities; the central assumption is the effectiveness of the self-supervised loss.

axioms (1)
  • domain assumption A siamese network trained with the described novel loss functions can learn consistent interest point scores and positions from unlabeled images alone.
    This is the core premise of the self-supervised training approach.

pith-pipeline@v0.9.0 · 5767 in / 1200 out tokens · 33322 ms · 2026-05-25T00:49:27.716919+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

55 extracted references · 55 canonical work pages · 1 internal anchor

  1. [1]

    Deep learning

    Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436–444, May 2015

  2. [2]

    (AlexNet) im- agenet classification with deep convolutional neural net- works

    A Krizhevsky, I Sutskever, and Ge Hinton. (AlexNet) im- agenet classification with deep convolutional neural net- works. Adv. Neural Inf. Process. Syst. , pages 1097–1105, 2012

  3. [3]

    Delving deep into rectifiers: Surpassing human-level per- formance on imagenet classification

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Delving deep into rectifiers: Surpassing human-level per- formance on imagenet classification. InProceedings of the IEEE international conference on computer vision , pages 1026–1034, 2015

  4. [4]

    Dermatologist-level classification of skin cancer with deep neural networks

    Andre Esteva, Brett Kuprel, Roberto A Novoa, Justin Ko, Susan M Swetter, Helen M Blau, and Sebastian Thrun. Dermatologist-level classification of skin cancer with deep neural networks. Nature, 542(7639):115–118, February 2017

  5. [5]

    Lip reading sentences in the wild

    Joon Son Chung, Andrew Senior, Oriol Vinyals, and Andrew Zisserman. Lip reading sentences in the wild. November 2016

  6. [6]

    Cardiologist- Level arrhythmia detection with convolutional neural net- works

    Pranav Rajpurkar, Awni Y Hannun, Masoumeh Hagh- panahi, Codie Bourn, and Andrew Y Ng. Cardiologist- Level arrhythmia detection with convolutional neural net- works. July 2017

  7. [7]

    Multiple View Geometry in Computer Vision

    Richard Hartley and Andrew Zisserman. Multiple View Geometry in Computer Vision . Cambridge University Press, 2003

  8. [8]

    Posenet: A convolutional network for real-time 6-dof cam- era relocalization

    Alex Kendall, Matthew Grimes, and Roberto Cipolla. Posenet: A convolutional network for real-time 6-dof cam- era relocalization. In Computer Vision (ICCV), 2015 IEEE International Conference on, pages 2938–2946, 2015

  9. [9]

    Geometric loss func- tions for camera pose regression with deep learning

    Alex Kendall and Roberto Cipolla. Geometric loss func- tions for camera pose regression with deep learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5974–5983, 2017

  10. [10]

    Deep image homography estimation

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Deep image homography estimation. June 2016

  11. [11]

    Efficient deep learning for stereo matching, 2016

    Wenjie Luo, Alexander G Schwing, and Raquel Urtasun. Efficient deep learning for stereo matching, 2016

  12. [12]

    DeepVO: To- wards end-to-end visual odometry with deep recurrent con- volutional neural networks

    S Wang, R Clark, H Wen, and N Trigoni. DeepVO: To- wards end-to-end visual odometry with deep recurrent con- volutional neural networks. In 2017 IEEE International Conference on Robotics and Automation (ICRA) , pages 2043–2050, May 2017

  13. [13]

    Unsupervised monocular depth estimation with left- right consistency

    Clément Godard, Oisin Mac Aodha, and Gabriel J Bros- tow. Unsupervised monocular depth estimation with left- right consistency. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages 270– 279, 2017

  14. [14]

    A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK

    S A K Tareen and Z Saleem. A comparative analysis of SIFT, SURF, KAZE, AKAZE, ORB, and BRISK. In 2018 International Conference on Computing, Mathematics and Engineering Technologies (iCoMET) , pages 1–10, March 2018

  15. [15]

    Distinctive image features from Scale- Invariant keypoints, 2004

    David G Lowe. Distinctive image features from Scale- Invariant keypoints, 2004

  16. [16]

    Speeded-Up robust features (SURF)

    Herbert Bay, Andreas Ess, Tinne Tuytelaars, and Luc Van Gool. Speeded-Up robust features (SURF). Comput. Vis. Image Underst., 110(3):346–359, June 2008

  17. [17]

    ORB: An efficient alternative to SIFT or SURF

    Ethan Rublee, Vincent Rabaud, Kurt Konolige, and Gary R Bradski. ORB: An efficient alternative to SIFT or SURF. In ICCV, volume 11, page 2, 2011

  18. [18]

    Fast explicit diffusion for accelerated features in nonlinear scale spaces, 2013

    Pablo Alcantarilla, Jesus Nuevo, and Adrien Bartoli. Fast explicit diffusion for accelerated features in nonlinear scale spaces, 2013

  19. [19]

    BRISK: Binary robust invariant scalable keypoints

    Stefan Leutenegger, Margarita Chli, and Roland Siegwart. BRISK: Binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV), pages 2548–2555, 2011

  20. [20]

    Bundle adjustment—a modern synthesis

    Bill Triggs, Philip F McLauchlan, Richard I Hartley, and Andrew W Fitzgibbon. Bundle adjustment—a modern synthesis. In International workshop on vision algorithms, pages 298–372, 1999

  21. [21]

    ORB- SLAM: A versatile and accurate monocular SLAM sys- tem

    R Mur-Artal, J M M Montiel, and J D Tardós. ORB- SLAM: A versatile and accurate monocular SLAM sys- tem. IEEE Trans. Rob., 31(5):1147–1163, October 2015

  22. [22]

    ORB-SLAM2: An Open- Source SLAM system for monocular, stereo, and RGB- D cameras

    R Mur-Artal and J D Tardós. ORB-SLAM2: An Open- Source SLAM system for monocular, stereo, and RGB- D cameras. IEEE Trans. Rob., 33(5):1255–1262, October 2017

  23. [23]

    Fast relocalisation and loop closing in keyframe-based SLAM

    Raúl Mur-Artal and Juan D Tardós. Fast relocalisation and loop closing in keyframe-based SLAM. In 2014 IEEE International Conference on Robotics and Automa- tion (ICRA), pages 846–853, 2014

  24. [24]

    Incremental fusion of Structure-from- Motion and GPS using constrained bundle adjustments

    Maxime Lhuillier. Incremental fusion of Structure-from- Motion and GPS using constrained bundle adjustments. IEEE Trans. Pattern Anal. Mach. Intell. , 34(12):2489– 2495, December 2012

  25. [25]

    Keyframe-based visual-inertial slam using nonlinear op- timization

    Stefan Leutenegger, Paul Furgale, Vincent Rabaud, Margarita Chli, Kurt Konolige, and Roland Siegwart. Keyframe-based visual-inertial slam using nonlinear op- timization. Proceedings of Robotis Science and Systems (RSS) 2013, 2013. 13 UnsuperPoint: End-to-end Unsupervised Interest Point Detector and Descriptor

  26. [26]

    VINS-Mono: A robust and versa- tile monocular Visual-Inertial state estimator

    T Qin, P Li, and S Shen. VINS-Mono: A robust and versa- tile monocular Visual-Inertial state estimator. IEEE Trans. Rob., 34(4):1004–1020, August 2018

  27. [27]

    Match- net: Unifying feature and metric learning for patch-based matching

    X Han, T Leung, Y Jia, R Sukthankar, and others. Match- net: Unifying feature and metric learning for patch-based matching. Proc. IEEE, 2015

  28. [28]

    Discrim- inative learning of local image descriptors

    Matthew Brown, Gang Hua, and Simon Winder. Discrim- inative learning of local image descriptors. IEEE Trans. Pattern Anal. Mach. Intell., 33(1):43–57, January 2011

  29. [29]

    Dis- criminative learning of deep convolutional feature point descriptors

    Edgar Simo-Serra, Eduard Trulls, Luis Ferraz, Iasonas Kokkinos, Pascal Fua, and Francesc Moreno-Noguer. Dis- criminative learning of deep convolutional feature point descriptors. In Proceedings of the IEEE International Con- ference on Computer Vision, pages 118–126, 2015

  30. [30]

    Learning to compare image patches via convolutional neural networks

    Sergey Zagoruyko and Nikos Komodakis. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 4353–4361, 2015

  31. [31]

    PN-Net: Conjoined triple deep network for learning local image descriptors

    Vassileios Balntas, Edward Johns, Lilian Tang, and Krys- tian Mikolajczyk. PN-Net: Conjoined triple deep network for learning local image descriptors. January 2016

  32. [32]

    L2-net: Deep learn- ing of discriminative patch descriptor in euclidean space

    Yurun Tian, Bin Fan, and Fuchao Wu. L2-net: Deep learn- ing of discriminative patch descriptor in euclidean space. In Proceedings of the IEEE Conference on Computer Vi- sion and Pattern Recognition, pages 661–669, 2017

  33. [33]

    TILDE: A temporally invariant learned DEtector

    Yannick Verdie, Kwang Moo Yi, Pascal Fua, and Vincent Lepetit. TILDE: A temporally invariant learned DEtector. November 2014

  34. [34]

    Quad-networks: unsupervised learning to rank for interest point detection

    Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sattler, and Marc Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. In Proceed- ings of the IEEE conference on computer vision and pat- tern recognition, pages 1822–1830, 2017

  35. [35]

    LIFT: Learned invariant feature transform

    Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pas- cal Fua. LIFT: Learned invariant feature transform. In Computer Vision – ECCV 2016 , Lecture Notes in Com- puter Science, pages 467–483. Springer, Cham, October 2016

  36. [36]

    Spatial transformer networks

    Max Jaderberg, Karen Simonyan, Andrew Zisserman, and Koray Kavukcuoglu. Spatial transformer networks. June 2015

  37. [37]

    LF-Net: Learning local features from images

    Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. LF-Net: Learning local features from images. In S Bengio, H Wallach, H Larochelle, K Grauman, N Cesa-Bianchi, and R Garnett, editors, Advances in Neural Information Processing Systems 31 , pages 6234–6244. Curran Asso- ciates, Inc., 2018

  38. [38]

    Superpoint: Self-supervised interest point detec- tion and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detec- tion and description. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition Work- shops, pages 224–236, 2018

  39. [39]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal covariate shift. CoRR, abs/1502.03167, 2015

  40. [40]

    Efficient adaptive non-maximal suppression algorithms for homo- geneous spatial keypoint distribution

    Oleksandr Bailo, Francois Rameau, Kyungdon Joo, Jin- sun Park, Oleksandr Bogdan, and In So Kweon. Efficient adaptive non-maximal suppression algorithms for homo- geneous spatial keypoint distribution. Pattern Recognit. Lett., 106:53–60, April 2018

  41. [41]

    Adam Paszke, Sam Gross, Soumith Chintala, and Gregory Chanan. Pytorch. Computer software. V ers. 0. 3, 1, 2017

  42. [42]

    Microsoft COCO: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Dollár, and C Lawrence Zitnick. Microsoft COCO: Common objects in context. In Computer Vision – ECCV 2014 , pages 740–

  43. [43]

    Springer International Publishing, 2014

  44. [44]

    Adam: A method for stochastic optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. December 2014

  45. [45]

    Evaluation of interest point detectors

    Cordelia Schmid, Roger Mohr, and Christian Bauckhage. Evaluation of interest point detectors. Int. J. Comput. Vis., 37(2):151–172, June 2000

  46. [46]

    Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography

    Martin A Fischler and Robert C Bolles. Random sample consensus: A paradigm for model fitting with applications to image analysis and automated cartography. Commun. ACM, 24(6):381–395, June 1981

  47. [47]

    HPatches: A benchmark and evaluation of handcrafted and learned local descriptors

    Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. HPatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceed- ings of the IEEE Conference on Computer Vision and Pat- tern Recognition, pages 5173–5182, 2017

  48. [48]

    Superpoint

    Daniel DeTone. Superpoint. https: //github.com/MagicLeapResearch/ SuperPointPretrainedNetwork

  49. [49]

    Yuki Ono. Lf-net. https://github.com/vcg-uvic/ lf-net-release

  50. [50]

    Identity mappings in deep residual networks

    Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Identity mappings in deep residual networks. In Computer Vision – ECCV 2016, pages 630–645. Springer, Cham, Oc- tober 2016

  51. [51]

    Densely connected convolutional networks

    Gao Huang, Zhuang Liu, Kilian Q Weinberger, and Lau- rens van der Maaten. Densely connected convolutional networks. August 2016

  52. [52]

    Squeeze-and-Excitation networks

    Jie Hu, Li Shen, Samuel Albanie, Gang Sun, and Enhua Wu. Squeeze-and-Excitation networks. September 2017

  53. [53]

    Xception: Deep learning with depthwise separable convolutions

    François Chollet. Xception: Deep learning with depthwise separable convolutions. October 2016

  54. [54]

    MobileNets: Efficient con- volutional neural networks for mobile vision applications

    Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco An- dreetto, and Hartwig Adam. MobileNets: Efficient con- volutional neural networks for mobile vision applications. April 2017

  55. [55]

    Fully convolutional networks for semantic segmentation, 2015

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation, 2015. 14