ELF: Embedded Localisation of Features in pre-trained CNN

Assia Benbihi; C\'edric Pradalier; Matthieu Geist

arxiv: 1907.03261 · v1 · pith:TGXJZFJTnew · submitted 2019-07-07 · 💻 cs.CV

ELF: Embedded Localisation of Features in pre-trained CNN

Assia Benbihi , Matthieu Geist , C\'edric Pradalier This is my paper

Pith reviewed 2026-05-25 01:30 UTC · model grok-4.3

classification 💻 cs.CV

keywords feature detectionkeypoint localizationpre-trained CNNsaliency mapgradient extractionrepeatabilitymatching scoreHPatches

0 comments

The pith

A CNN trained only for classification already embeds the location information needed to detect repeatable image keypoints.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that gradients of a feature map from a standard pre-trained CNN, taken with respect to the input image, produce saliency maps whose local maxima mark usable keypoints. This extraction requires no extra supervised training or fine-tuning. On the HPatches benchmark the resulting detector matches the repeatability and matching scores of detectors trained specifically for the task, and it shows comparable robustness to viewpoint and illumination changes on Webcam and photo-tourism images. The work therefore claims that location information is already present inside networks trained on ordinary tasks. A reader would care because this removes the need to collect new labeled data or run separate training whenever a feature detector is required.

Core claim

The central claim is that feature location information is embedded inside a CNN trained on standard tasks such as classification, and that this information can be recovered by computing the gradient of the feature map with respect to the input image; the resulting saliency map has local maxima at relevant keypoint locations, yielding a detector whose repeatability and matching performance on HPatches, Webcam, and photo-tourism data equals that of networks trained explicitly for keypoint detection.

What carries the argument

The gradient of the feature map with respect to the input image, which forms a saliency map whose local maxima serve as keypoint locations.

If this is right

A single network trained for classification can supply both descriptors and detector locations.
Feature detection becomes possible on any pre-trained CNN without collecting new keypoint labels.
The same gradient procedure can be applied across different backbone architectures and training tasks.
Performance remains stable under the illumination and viewpoint changes tested in the Webcam and photo-tourism sets.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The result suggests that spatial structure is learned as a side-effect of classification training and could be harvested for other localization tasks.
One could test whether the same gradient extraction works on networks trained for segmentation or detection rather than pure classification.
If the method generalizes, it would reduce the cost of deploying feature detectors in new domains where only classification data are available.

Load-bearing premise

The local maxima of the gradient-derived saliency map coincide with relevant and repeatable keypoint positions.

What would settle it

On the HPatches dataset the repeatability score of the gradient-based detector falls substantially below the scores reported for detectors trained specifically for keypoint detection.

Figures

Figures reproduced from arXiv: 1907.03261 by Assia Benbihi, C\'edric Pradalier, Matthieu Geist.

**Figure 2.** Figure 2: Saliency maps thresholding to keep only the [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 3.** Figure 3: (Bigger version Figure 15.) Saliency maps com [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Left-Right: HPatches: planar viewpoint. Web [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Top-Down: HPatches-Webcam. Left-Right: re [PITH_FULL_IMAGE:figures/full_fig_p007_5.png] view at source ↗

**Figure 6.** Figure 6: HPatches scale. Left-Right: rep, ms. Scale Robustness. ELF-VGG is compared with stateof-the art detectors and their respective descriptors (Figure 6). Repeatability is mostly stable for all methods: SIFT and SuperPoint are the most invariant whereas ELF follows the same variations as LIFT and LF-Net. Once again, ms better assesses the detectors performance: SuperPoint is the most robust to scale changes… view at source ↗

**Figure 7.** Figure 7: HPatches rotation. Left-Right: rep, ms. Rotation Robustness. Even though rep shows little variations ( [PITH_FULL_IMAGE:figures/full_fig_p008_7.png] view at source ↗

**Figure 11.** Figure 11: Feature gradient (right) provides a sparser sig [PITH_FULL_IMAGE:figures/full_fig_p009_11.png] view at source ↗

**Figure 12.** Figure 12: Green lines show putative matches of the sim [PITH_FULL_IMAGE:figures/full_fig_p009_12.png] view at source ↗

**Figure 10.** Figure 10: Gradient baseline. Gradient Baseline The saliency map used in ELF is replaced with simple Sobel or Laplacian gradient maps. The rest of the detection pipeline stays the same and we compute their performance ( [PITH_FULL_IMAGE:figures/full_fig_p009_10.png] view at source ↗

**Figure 14.** Figure 14: SfM from small subsets. Evolution of mAP of camera pose for increasing tolerance threshold. Structure-from-Motion from small subsets. Task 2 “proposes to to build SfM reconstructions from small (3, 5, 10, 25) subsets of images and use the poses obtained from the entire (much larger) set as ground truth” [1] [PITH_FULL_IMAGE:figures/full_fig_p010_14.png] view at source ↗

**Figure 13.** Figure 13: Wide stereo matching. Left: matching score (%) of sparse methods (up to 512 keypoints) on phototourism. Right: Evolution of mAP of camera pose for increasing tolerance threshold (degrees) [PITH_FULL_IMAGE:figures/full_fig_p010_13.png] view at source ↗

**Figure 15.** Figure 15: Enlargement of Figure 3. Saliency maps computed from the feature map gradient [PITH_FULL_IMAGE:figures/full_fig_p015_15.png] view at source ↗

read the original abstract

This paper introduces a novel feature detector based only on information embedded inside a CNN trained on standard tasks (e.g. classification). While previous works already show that the features of a trained CNN are suitable descriptors, we show here how to extract the feature locations from the network to build a detector. This information is computed from the gradient of the feature map with respect to the input image. This provides a saliency map with local maxima on relevant keypoint locations. Contrary to recent CNN-based detectors, this method requires neither supervised training nor finetuning. We evaluate how repeatable and how matchable the detected keypoints are with the repeatability and matching scores. Matchability is measured with a simple descriptor introduced for the sake of the evaluation. This novel detector reaches similar performances on the standard evaluation HPatches dataset, as well as comparable robustness against illumination and viewpoint changes on Webcam and photo-tourism images. These results show that a CNN trained on a standard task embeds feature location information that is as relevant as when the CNN is specifically trained for feature detection.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces ELF, a feature detector that extracts keypoint locations from the gradients of feature maps in a pre-trained CNN (trained on standard tasks like classification) to produce a saliency map whose local maxima serve as keypoints. No supervised training or finetuning is required. It reports repeatability and matching scores on HPatches comparable to specialized detectors, plus robustness to illumination and viewpoint changes on Webcam and photo-tourism images, concluding that location information is embedded in standard CNNs at a level relevant for feature detection.

Significance. If validated with rigorous controls, the result would indicate that gradient-based saliency from classification CNNs can yield repeatable keypoints without task-specific training, potentially allowing reuse of existing models for detection and reducing the need for dedicated feature-detection networks.

major comments (3)

[Abstract / method] The central claim that gradients of an intermediate feature map produce saliency maxima on 'relevant keypoint locations' (abstract) lacks supporting analysis of what the saliency responds to. Classification-trained gradients typically emphasize semantic or class-discriminative regions rather than viewpoint-invariant local structures such as corners; without layer-specific ablation or response characterization (e.g., edge vs. blob), the 'embedded localisation' interpretation is not load-bearing.
[Evaluation] Matchability is evaluated with a simple descriptor introduced solely for this paper. This choice prevents direct comparison to prior detectors that use standard or learned descriptors, undermining the claim of 'similar performances' on HPatches.
[Experiments] No control experiment is described that isolates the contribution of the gradient step (e.g., showing that the same CNN without gradient-based localization fails at repeatable detection). This is required to establish that the observed scores arise from embedded location information rather than incidental properties of the feature maps.

minor comments (2)

[Abstract] The abstract and evaluation sections should explicitly state the CNN architecture, layer index, and exact procedure (thresholding, non-maximum suppression) used to extract local maxima from the saliency map.
[Evaluation] Quantitative tables with error bars, number of images, and statistical significance tests are needed to support the 'comparable' and 'robust' claims; their absence makes the reported scores difficult to interpret.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract / method] The central claim that gradients of an intermediate feature map produce saliency maxima on 'relevant keypoint locations' (abstract) lacks supporting analysis of what the saliency responds to. Classification-trained gradients typically emphasize semantic or class-discriminative regions rather than viewpoint-invariant local structures such as corners; without layer-specific ablation or response characterization (e.g., edge vs. blob), the 'embedded localisation' interpretation is not load-bearing.

Authors: We agree that additional characterization of the saliency response would strengthen the interpretation of the results. In the revised manuscript we will add layer-specific ablations (showing repeatability across different convolutional layers) together with qualitative examples and quantitative measures of the types of structures (edges, corners, blobs) to which the gradient-based saliency responds on the evaluation datasets. revision: yes
Referee: [Evaluation] Matchability is evaluated with a simple descriptor introduced solely for this paper. This choice prevents direct comparison to prior detectors that use standard or learned descriptors, undermining the claim of 'similar performances' on HPatches.

Authors: The simple descriptor was introduced to isolate detector performance from descriptor quality. We acknowledge that this limits direct comparability with prior work. In the revision we will report additional matching scores on HPatches using a standard hand-crafted descriptor (SIFT) and, where feasible, a learned descriptor, allowing readers to compare ELF keypoints against published detector-descriptor combinations. revision: yes
Referee: [Experiments] No control experiment is described that isolates the contribution of the gradient step (e.g., showing that the same CNN without gradient-based localization fails at repeatable detection). This is required to establish that the observed scores arise from embedded location information rather than incidental properties of the feature maps.

Authors: We will add a control experiment that replaces the gradient-based saliency map with the raw feature-map activations (or their spatial maximum) as the keypoint source, keeping the same CNN and post-processing. This will quantify the specific contribution of the gradient computation to repeatability and matchability. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method and evaluation are self-contained

full rationale

The paper presents an empirical detector that computes saliency from gradients of a pre-trained CNN feature map and evaluates repeatability/matching on HPatches, Webcam, and photo-tourism data. No equations, parameter fits, self-citations, or uniqueness theorems are invoked that would reduce the central claim (embedded location information is comparably relevant) to a definition or input quantity by construction. The derivation chain consists of a proposed extraction procedure plus external dataset benchmarks, with no load-bearing steps that collapse into the method's own choices.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on the domain assumption that gradients of internal feature maps highlight repeatable keypoints; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Gradient of feature map w.r.t. input image yields a saliency map with local maxima at relevant keypoints
Explicitly stated in the abstract as the mechanism that produces the detector.

pith-pipeline@v0.9.0 · 5715 in / 1164 out tokens · 28730 ms · 2026-05-25T01:30:04.495812+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

54 extracted references · 54 canonical work pages · 4 internal anchors

[1]

https: //image-matching-workshop.github.io/ challenge/, 2019

Cvpr19 image matching challenge. https: //image-matching-workshop.github.io/ challenge/, 2019

work page 2019
[2]

L., AND PEDERSEN , K

A ANÆS , H., D AHL , A. L., AND PEDERSEN , K. S. Inter- esting interest points. International Journal of Computer Vision 97, 1 (2012), 18–35

work page 2012
[3]

Tensorﬂow: a system for large-scale machine learning

A BADI , M., B ARHAM , P., C HEN , J., C HEN , Z., D AVIS, A., D EAN , J., D EVIN , M., G HEMAWAT, S., I RVING , G., ISARD , M., ET AL . Tensorﬂow: a system for large-scale machine learning. In OSDI (2016), vol. 16, pp. 265–283

work page 2016
[4]

F., B ARTOLI , A., AND DAVISON , A

A LCANTARILLA , P. F., B ARTOLI , A., AND DAVISON , A. J. Kaze features. In European Conference on Com- puter Vision (2012), Springer, pp. 214–227

work page 2012
[5]

Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors

B ALNTAS , V., L ENC , K., V EDALDI , A., AND MIKO- LAJCZYK , K. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017), vol. 4, p. 6

work page 2017
[6]

Learning local feature descriptors with triplets and shallowconvolutional neural networks

B ALNTAS , V., R IBA, E., P ONSA , D., AND MIKOLA - JCZYK , K. Learning local feature descriptors with triplets and shallowconvolutional neural networks. In BMVC (2016), vol. 1, p. 3

work page 2016
[7]

Surf: Speeded up robust features

B AY, H., T UYTELAARS , T., AND VAN GOOL , L. Surf: Speeded up robust features. In European conference on computer vision (2006), Springer, pp. 404–417

work page 2006
[8]

Brief: Binary robust independent elementary fea- tures

C ALONDER , M., L EPETIT , V., S TRECHA , C., AND FUA, P. Brief: Binary robust independent elementary fea- tures. In European conference on computer vision (2010), Springer, pp. 778–792

work page 2010
[9]

Xception: Deep learning with depthwise separable convolutions

C HOLLET , F. Xception: Deep learning with depthwise separable convolutions. In2017 IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2017, Hon- olulu, HI, USA, July 21-26, 2017 (2017), pp. 1800–1807

work page 2017
[10]

B., G WAK, J., S AVARESE , S., AND CHAN - DRAKER , M

C HOY, C. B., G WAK, J., S AVARESE , S., AND CHAN - DRAKER , M. Universal correspondence network. In Ad- vances in Neural Information Processing Systems (2016), pp. 2414–2422

work page 2016
[11]

Imagenet: A large-scale hierarchical im- age database

D ENG , J., D ONG , W., S OCHER , R., L I, L.-J., L I, K., AND FEI-F EI, L. Imagenet: A large-scale hierarchical im- age database. In Computer Vision and Pattern Recogni- tion, 2009. CVPR 2009. IEEE Conference on (2009), Ieee, pp. 248–255

work page 2009
[12]

Superpoint: Self-supervised interest point detection and description

D ETONE , D., M ALISIEWICZ , T., AND RABINOVICH , A. Superpoint: Self-supervised interest point detection and description. In CVPR Deep Learning for Visual SLAM Workshop (2018)

work page 2018
[13]

Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT

F ISCHER , P., DOSOVITSKIY , A., AND BROX, T. Descrip- tor matching with convolutional neural networks: a com- parison to sift. arXiv preprint arXiv:1405.5769 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[14]

A., E CKER , A

G ATYS, L. A., E CKER , A. S., AND BETHGE , M. Image style transfer using convolutional neural networks. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 2414–2423

work page 2016
[15]

H AN, X., L EUNG , T., J IA, Y., S UKTHANKAR , R., AND BERG , A. C. Matchnet: Unifying feature and metric learn- ing for patch-based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3279–3286

work page 2015
[16]

L., D UNN , E., AND FRAHM , J.-M

H EINLY, J., S CHONBERGER , J. L., D UNN , E., AND FRAHM , J.-M. Reconstructing the world* in six days*(as captured by the yahoo 100 million image dataset). In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3287–3295

work page 2015
[17]

N., S AHOO , P

K APUR , J. N., S AHOO , P. K., AND WONG , A. K. A new method for gray-level picture thresholding using the entropy of the histogram. Computer vision, graphics, and image processing 29, 3 (1985), 273–285

work page 1985
[18]

K RIZHEVSKY , A., S UTSKEVER , I., AND HINTON , G. E. Imagenet classiﬁcation with deep convolutional neural net- works. In Advances in neural information processing sys- tems (2012), pp. 1097–1105

work page 2012
[19]

Vlbenchmkars

L ENC , K., G ULSHAN , V., AND VEDALDI , A. Vlbenchmkars. http://www.vlfeat.org/ benchmarks/xsxs, 2011

work page 2011
[20]

L IN, T.-Y., M AIRE , M., B ELONGIE , S., H AYS, J., P ER- ONA, P., R AMANAN , D., D OLL ´AR, P., AND ZITNICK , C. L. Microsoft coco: Common objects in context. In European conference on computer vision (2014), Springer, pp. 740–755

work page 2014
[21]

L OWE, D. G. Distinctive image features from scale- invariant keypoints. International journal of computer vi- sion 60, 2 (2004), 91–110

work page 2004
[22]

Understanding deep image representations by inverting them

M AHENDRAN , A., AND VEDALDI , A. Understanding deep image representations by inverting them. In Proceed- ings of the IEEE conference on computer vision and pat- tern recognition (2015), pp. 5188–5196

work page 2015
[23]

Robust wide-baseline stereo from maximally stable ex- tremal regions

M ATAS, J., C HUM , O., U RBAN , M., AND PAJDLA , T. Robust wide-baseline stereo from maximally stable ex- tremal regions. Image and vision computing 22, 10 (2004), 761–767

work page 2004
[24]

Siamese network features for image matching

M ELEKHOV , I., K ANNALA , J., AND RAHTU , E. Siamese network features for image matching. In 2016 23rd In- ternational Conference on Pattern Recognition (ICPR) (2016), IEEE, pp. 378–383

work page 2016
[25]

A performance evaluation of local descriptors

M IKOLAJCZYK , K., AND SCHMID , C. A performance evaluation of local descriptors. IEEE transactions on pattern analysis and machine intelligence 27 , 10 (2005), 1615–1630. 11

work page 2005
[26]

A comparison of afﬁne region detectors

M IKOLAJCZYK , K., T UYTELAARS , T., S CHMID , C., ZISSERMAN , A., M ATAS, J., S CHAFFALITZKY , F., KADIR , T., AND VAN GOOL , L. A comparison of afﬁne region detectors. International journal of computer vision 65, 1-2 (2005), 43–72

work page 2005
[27]

Largescale image retrieval with attentive deep local features

N OH, H., A RAUJO , A., S IM, J., W EYAND , T., AND HAN, B. Largescale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 3456–3465

work page 2017
[28]

Lf-net: Learning local features from images

O NO, Y., T RULLS , E., F UA, P., AND K.M.Y I. Lf-net: Learning local features from images. In Advances in Neu- ral Information Processing Systems (2018)

work page 2018
[29]

Machine learning for high-speed corner detection

R OSTEN , E., AND DRUMMOND , T. Machine learning for high-speed corner detection. In European conference on computer vision (2006), Springer, pp. 430–443

work page 2006
[30]

Orb: An efﬁcient alternative to sift or surf

R UBLEE , E., R ABAUD , V., K ONOLIGE , K., AND BRAD - SKI , G. Orb: An efﬁcient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE international confer- ence on (2011), IEEE, pp. 2564–2571

work page 2011
[31]

Quad-networks: unsupervised learning to rank for interest point detection

S AVINOV, N., S EKI , A., L ADICKY , L., S ATTLER , T., AND POLLEFEYS , M. Quad-networks: unsupervised learning to rank for interest point detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

work page 2017
[32]

L., AND FRAHM , J.-M

S CHONBERGER , J. L., AND FRAHM , J.-M. Structure- from-motion revisited. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (2016), pp. 4104–4113

work page 2016
[33]

R., C OGSWELL , M., D AS, A., V EDAN - TAM, R., P ARIKH , D., B ATRA, D., ET AL

S ELVARAJU , R. R., C OGSWELL , M., D AS, A., V EDAN - TAM, R., P ARIKH , D., B ATRA, D., ET AL . Grad-cam: Vi- sual explanations from deep networks via gradient-based localization. In ICCV (2017), pp. 618–626

work page 2017
[34]

Discrimi- native learning of deep convolutional feature point descrip- tors

S IMO -S ERRA , E., T RULLS , E., F ERRAZ , L., K OKKI - NOS , I., F UA, P., AND MORENO -NOGUER , F. Discrimi- native learning of deep convolutional feature point descrip- tors. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 118–126

work page 2015
[35]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

S IMONYAN , K., V EDALDI , A., AND ZISSERMAN , A. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013
[36]

Very Deep Convolutional Networks for Large-Scale Image Recognition

S IMONYAN , K., AND ZISSERMAN , A. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014
[37]

SmoothGrad: removing noise by adding noise

S MILKOV, D., T HORAT, N., K IM, B., V I ´EGAS , F., AND WATTENBERG , M. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017
[38]

Striving for simplicity: The all convo- lutional net

S PRINGENBERG , J., D OSOVITSKIY , A., B ROX, T., AND RIEDMILLER , M. Striving for simplicity: The all convo- lutional net. In ICLR (workshop track) (2015)

work page 2015
[39]

On benchmarking camera calibration and multi-view stereo for high resolution im- agery

S TRECHA , C., V ON HANSEN , W., VAN GOOL , L., F UA, P., AND THOENNESSEN , U. On benchmarking camera calibration and multi-view stereo for high resolution im- agery. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (2008), Ieee, pp. 1–8

work page 2008
[40]

Ax- iomatic attribution for deep networks

S UNDARARAJAN , M., T ALY, A., AND YAN, Q. Ax- iomatic attribution for deep networks. In International Conference on Machine Learning (2017), pp. 3319–3328

work page 2017
[41]

Inloc: Indoor visual localization with dense matching and view synthesis

T AIRA , H., O KUTOMI , M., S ATTLER , T., C IMPOI , M., POLLEFEYS , M., S IVIC , J., P AJDLA , T., AND TORII , A. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (2018), pp. 7199–7209

work page 2018
[42]

A., F RIEDLAND , G., ELIZALDE , B., N I, K., P OLAND , D., B ORTH , D., AND LI, L.-J

T HOMEE , B., S HAMMA , D. A., F RIEDLAND , G., ELIZALDE , B., N I, K., P OLAND , D., B ORTH , D., AND LI, L.-J. Yfcc100m: The new data in multimedia research. Communications of the ACM 59 , 2, 64–73

work page
[43]

Tilde: A temporally invariant learned detector

V ERDIE , Y., Y I, K., F UA, P., AND LEPETIT , V. Tilde: A temporally invariant learned detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 5279–5288

work page 2015
[44]

M., T RULLS , E., L EPETIT , V., AND FUA, P

Y I, K. M., T RULLS , E., L EPETIT , V., AND FUA, P. Lift: Learned invariant feature transform. In European Confer- ence on Computer Vision (2016), Springer, pp. 467–483

work page 2016
[45]

Learning to compare image patches via convolutional neural networks

Z AGORUYKO , S., AND KOMODAKIS , N. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 4353–4361

work page 2015
[46]

D., AND FERGUS , R

Z EILER , M. D., AND FERGUS , R. Visualizing and under- standing convolutional networks. In European conference on computer vision (2014), Springer, pp. 818–833. 12 A Metrics deﬁnition We explicit the repeatability and matching score deﬁni- tions introduced in [26] and our adaptations using the fol- lowing notations: let (I1, I2), be a pair of images and ...

work page 2014
[47]

and LF-Net [28]. Matching score The matching score deﬁnition intro- duced in [26] captures the percentage of keypoint pairs that are nearest neighbours both in image space and in descriptor space, and for which these two distances are below their respective thresholdϵkp andϵd. LetM be de- ﬁned as in the previous paragraph andMd be the analog ofM when the ...

work page
[48]

Robust- ness to light (Webcam [43])

[43] [5] [43] ELF-VGG 63.81 53.23 51.84 43.73 ELF-AlexNet 51.30 38.54 35.21 31.92 ELF-Xception 48.06 49.84 29.81 35.48 ELF-SuperPoint 59.7 46.29 44.32 18.11 ELF-LFNet 60.1 41.90 44.56 33.43 LF-Net 61.16 48.27 34.19 18.10 SuperPoint 68.57 46.35 57.11 32.44 LIFT 54.66 42.21 34.02 17.83 SURF 54.51 33.93 26.10 10.13 SIFT 51.19 28.25 24.58 8.30 ORB 53.44 31.56...

work page
[49]

34.19 57.11 34.02 24.58 26.10 14.76 44.19 53.71 39.48 27.03 34.97 20.04

work page
[50]

9- stripes)

18.10 32.44 17.83 10.13 8.30 1.28 30.71 34.60 26.84 13.21 21.43 13.91 Table 2: Individual component performance (Fig. 9- stripes). Matching score for the integration of the VGG pool3 simple-descriptor with other’s detection. Top: Original description. Bottom: Integration of simple- descriptor. HPatches: [5]. Webcam: [43] 13 LF-Net SuperPoint LIFT SIFT SURF ORB

work page
[51]

34.19 57.11 34.02 24.58 26.10 14.76 39.16 54.44 42.48 50.63 30.91 36.96

work page
[52]

9- circle)

18.10 32.44 17.83 10.13 8.30 1.28 26.70 39.55 30.82 36.83 19.14 6.60 Table 3: Individual component performance (Fig. 9- circle). Matching score for the integration of ELF-VGG (on pool2) with other’s descriptor. Top: Original detec- tion. Bottom: Integration of ELF. HPatches: [5]. Web- cam: [43] Repeatability Matching Score

work page
[53]

[43] [5] [43] Sobel-VGG 56.99 33.74 42.11 20.99 Lapl.-VGG 65.45 33.74 55.25 22.79 VGG 63.81 53.23 51.84 43.73 Sobel-AlexNet 56.44 33.74 30.57 15.42 Lapl.-AlexNet 65.93 33.74 40.92 15.42 AlexNet 51.30 38.54 35.21 31.92 Sobel-Xception 56.44 33.74 34.14 16.86 Lapl.-Xception 65.93 33.74 42.52 16.86 Xception 48.06 49.84 29.81 35.48 Table 4: Gradient baseline o...

work page
[54]

(Fig. 10 ). C ELF Meta Parameters This section speciﬁes the meta parameters values for the ELF variants. For all methods, (wN M S,b N M S) = (10, 10). • Denoise: (µnoise,σ noise). • Threshold: (µthr,σ thr). • F l: the feature map which gradient is used for detec- tion. • simple-des: the feature map used for simple- description. Unless mentioned otherwise,...

work page

[1] [1]

https: //image-matching-workshop.github.io/ challenge/, 2019

Cvpr19 image matching challenge. https: //image-matching-workshop.github.io/ challenge/, 2019

work page 2019

[2] [2]

L., AND PEDERSEN , K

A ANÆS , H., D AHL , A. L., AND PEDERSEN , K. S. Inter- esting interest points. International Journal of Computer Vision 97, 1 (2012), 18–35

work page 2012

[3] [3]

Tensorﬂow: a system for large-scale machine learning

A BADI , M., B ARHAM , P., C HEN , J., C HEN , Z., D AVIS, A., D EAN , J., D EVIN , M., G HEMAWAT, S., I RVING , G., ISARD , M., ET AL . Tensorﬂow: a system for large-scale machine learning. In OSDI (2016), vol. 16, pp. 265–283

work page 2016

[4] [4]

F., B ARTOLI , A., AND DAVISON , A

A LCANTARILLA , P. F., B ARTOLI , A., AND DAVISON , A. J. Kaze features. In European Conference on Com- puter Vision (2012), Springer, pp. 214–227

work page 2012

[5] [5]

Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors

B ALNTAS , V., L ENC , K., V EDALDI , A., AND MIKO- LAJCZYK , K. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017), vol. 4, p. 6

work page 2017

[6] [6]

Learning local feature descriptors with triplets and shallowconvolutional neural networks

B ALNTAS , V., R IBA, E., P ONSA , D., AND MIKOLA - JCZYK , K. Learning local feature descriptors with triplets and shallowconvolutional neural networks. In BMVC (2016), vol. 1, p. 3

work page 2016

[7] [7]

Surf: Speeded up robust features

B AY, H., T UYTELAARS , T., AND VAN GOOL , L. Surf: Speeded up robust features. In European conference on computer vision (2006), Springer, pp. 404–417

work page 2006

[8] [8]

Brief: Binary robust independent elementary fea- tures

C ALONDER , M., L EPETIT , V., S TRECHA , C., AND FUA, P. Brief: Binary robust independent elementary fea- tures. In European conference on computer vision (2010), Springer, pp. 778–792

work page 2010

[9] [9]

Xception: Deep learning with depthwise separable convolutions

C HOLLET , F. Xception: Deep learning with depthwise separable convolutions. In2017 IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2017, Hon- olulu, HI, USA, July 21-26, 2017 (2017), pp. 1800–1807

work page 2017

[10] [10]

B., G WAK, J., S AVARESE , S., AND CHAN - DRAKER , M

C HOY, C. B., G WAK, J., S AVARESE , S., AND CHAN - DRAKER , M. Universal correspondence network. In Ad- vances in Neural Information Processing Systems (2016), pp. 2414–2422

work page 2016

[11] [11]

Imagenet: A large-scale hierarchical im- age database

D ENG , J., D ONG , W., S OCHER , R., L I, L.-J., L I, K., AND FEI-F EI, L. Imagenet: A large-scale hierarchical im- age database. In Computer Vision and Pattern Recogni- tion, 2009. CVPR 2009. IEEE Conference on (2009), Ieee, pp. 248–255

work page 2009

[12] [12]

Superpoint: Self-supervised interest point detection and description

D ETONE , D., M ALISIEWICZ , T., AND RABINOVICH , A. Superpoint: Self-supervised interest point detection and description. In CVPR Deep Learning for Visual SLAM Workshop (2018)

work page 2018

[13] [13]

Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT

F ISCHER , P., DOSOVITSKIY , A., AND BROX, T. Descrip- tor matching with convolutional neural networks: a com- parison to sift. arXiv preprint arXiv:1405.5769 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[14] [14]

A., E CKER , A

G ATYS, L. A., E CKER , A. S., AND BETHGE , M. Image style transfer using convolutional neural networks. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 2414–2423

work page 2016

[15] [15]

H AN, X., L EUNG , T., J IA, Y., S UKTHANKAR , R., AND BERG , A. C. Matchnet: Unifying feature and metric learn- ing for patch-based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3279–3286

work page 2015

[16] [16]

L., D UNN , E., AND FRAHM , J.-M

H EINLY, J., S CHONBERGER , J. L., D UNN , E., AND FRAHM , J.-M. Reconstructing the world* in six days*(as captured by the yahoo 100 million image dataset). In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3287–3295

work page 2015

[17] [17]

N., S AHOO , P

K APUR , J. N., S AHOO , P. K., AND WONG , A. K. A new method for gray-level picture thresholding using the entropy of the histogram. Computer vision, graphics, and image processing 29, 3 (1985), 273–285

work page 1985

[18] [18]

K RIZHEVSKY , A., S UTSKEVER , I., AND HINTON , G. E. Imagenet classiﬁcation with deep convolutional neural net- works. In Advances in neural information processing sys- tems (2012), pp. 1097–1105

work page 2012

[19] [19]

Vlbenchmkars

L ENC , K., G ULSHAN , V., AND VEDALDI , A. Vlbenchmkars. http://www.vlfeat.org/ benchmarks/xsxs, 2011

work page 2011

[20] [20]

L IN, T.-Y., M AIRE , M., B ELONGIE , S., H AYS, J., P ER- ONA, P., R AMANAN , D., D OLL ´AR, P., AND ZITNICK , C. L. Microsoft coco: Common objects in context. In European conference on computer vision (2014), Springer, pp. 740–755

work page 2014

[21] [21]

L OWE, D. G. Distinctive image features from scale- invariant keypoints. International journal of computer vi- sion 60, 2 (2004), 91–110

work page 2004

[22] [22]

Understanding deep image representations by inverting them

M AHENDRAN , A., AND VEDALDI , A. Understanding deep image representations by inverting them. In Proceed- ings of the IEEE conference on computer vision and pat- tern recognition (2015), pp. 5188–5196

work page 2015

[23] [23]

Robust wide-baseline stereo from maximally stable ex- tremal regions

M ATAS, J., C HUM , O., U RBAN , M., AND PAJDLA , T. Robust wide-baseline stereo from maximally stable ex- tremal regions. Image and vision computing 22, 10 (2004), 761–767

work page 2004

[24] [24]

Siamese network features for image matching

M ELEKHOV , I., K ANNALA , J., AND RAHTU , E. Siamese network features for image matching. In 2016 23rd In- ternational Conference on Pattern Recognition (ICPR) (2016), IEEE, pp. 378–383

work page 2016

[25] [25]

A performance evaluation of local descriptors

M IKOLAJCZYK , K., AND SCHMID , C. A performance evaluation of local descriptors. IEEE transactions on pattern analysis and machine intelligence 27 , 10 (2005), 1615–1630. 11

work page 2005

[26] [26]

A comparison of afﬁne region detectors

M IKOLAJCZYK , K., T UYTELAARS , T., S CHMID , C., ZISSERMAN , A., M ATAS, J., S CHAFFALITZKY , F., KADIR , T., AND VAN GOOL , L. A comparison of afﬁne region detectors. International journal of computer vision 65, 1-2 (2005), 43–72

work page 2005

[27] [27]

Largescale image retrieval with attentive deep local features

N OH, H., A RAUJO , A., S IM, J., W EYAND , T., AND HAN, B. Largescale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 3456–3465

work page 2017

[28] [28]

Lf-net: Learning local features from images

O NO, Y., T RULLS , E., F UA, P., AND K.M.Y I. Lf-net: Learning local features from images. In Advances in Neu- ral Information Processing Systems (2018)

work page 2018

[29] [29]

Machine learning for high-speed corner detection

R OSTEN , E., AND DRUMMOND , T. Machine learning for high-speed corner detection. In European conference on computer vision (2006), Springer, pp. 430–443

work page 2006

[30] [30]

Orb: An efﬁcient alternative to sift or surf

R UBLEE , E., R ABAUD , V., K ONOLIGE , K., AND BRAD - SKI , G. Orb: An efﬁcient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE international confer- ence on (2011), IEEE, pp. 2564–2571

work page 2011

[31] [31]

Quad-networks: unsupervised learning to rank for interest point detection

S AVINOV, N., S EKI , A., L ADICKY , L., S ATTLER , T., AND POLLEFEYS , M. Quad-networks: unsupervised learning to rank for interest point detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)

work page 2017

[32] [32]

L., AND FRAHM , J.-M

S CHONBERGER , J. L., AND FRAHM , J.-M. Structure- from-motion revisited. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (2016), pp. 4104–4113

work page 2016

[33] [33]

R., C OGSWELL , M., D AS, A., V EDAN - TAM, R., P ARIKH , D., B ATRA, D., ET AL

S ELVARAJU , R. R., C OGSWELL , M., D AS, A., V EDAN - TAM, R., P ARIKH , D., B ATRA, D., ET AL . Grad-cam: Vi- sual explanations from deep networks via gradient-based localization. In ICCV (2017), pp. 618–626

work page 2017

[34] [34]

Discrimi- native learning of deep convolutional feature point descrip- tors

S IMO -S ERRA , E., T RULLS , E., F ERRAZ , L., K OKKI - NOS , I., F UA, P., AND MORENO -NOGUER , F. Discrimi- native learning of deep convolutional feature point descrip- tors. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 118–126

work page 2015

[35] [35]

Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps

S IMONYAN , K., V EDALDI , A., AND ZISSERMAN , A. Deep inside convolutional networks: Visualising image classiﬁcation models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)

work page internal anchor Pith review Pith/arXiv arXiv 2013

[36] [36]

Very Deep Convolutional Networks for Large-Scale Image Recognition

S IMONYAN , K., AND ZISSERMAN , A. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)

work page internal anchor Pith review Pith/arXiv arXiv 2014

[37] [37]

SmoothGrad: removing noise by adding noise

S MILKOV, D., T HORAT, N., K IM, B., V I ´EGAS , F., AND WATTENBERG , M. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)

work page internal anchor Pith review Pith/arXiv arXiv 2017

[38] [38]

Striving for simplicity: The all convo- lutional net

S PRINGENBERG , J., D OSOVITSKIY , A., B ROX, T., AND RIEDMILLER , M. Striving for simplicity: The all convo- lutional net. In ICLR (workshop track) (2015)

work page 2015

[39] [39]

On benchmarking camera calibration and multi-view stereo for high resolution im- agery

S TRECHA , C., V ON HANSEN , W., VAN GOOL , L., F UA, P., AND THOENNESSEN , U. On benchmarking camera calibration and multi-view stereo for high resolution im- agery. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (2008), Ieee, pp. 1–8

work page 2008

[40] [40]

Ax- iomatic attribution for deep networks

S UNDARARAJAN , M., T ALY, A., AND YAN, Q. Ax- iomatic attribution for deep networks. In International Conference on Machine Learning (2017), pp. 3319–3328

work page 2017

[41] [41]

Inloc: Indoor visual localization with dense matching and view synthesis

T AIRA , H., O KUTOMI , M., S ATTLER , T., C IMPOI , M., POLLEFEYS , M., S IVIC , J., P AJDLA , T., AND TORII , A. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (2018), pp. 7199–7209

work page 2018

[42] [42]

A., F RIEDLAND , G., ELIZALDE , B., N I, K., P OLAND , D., B ORTH , D., AND LI, L.-J

T HOMEE , B., S HAMMA , D. A., F RIEDLAND , G., ELIZALDE , B., N I, K., P OLAND , D., B ORTH , D., AND LI, L.-J. Yfcc100m: The new data in multimedia research. Communications of the ACM 59 , 2, 64–73

work page

[43] [43]

Tilde: A temporally invariant learned detector

V ERDIE , Y., Y I, K., F UA, P., AND LEPETIT , V. Tilde: A temporally invariant learned detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 5279–5288

work page 2015

[44] [44]

M., T RULLS , E., L EPETIT , V., AND FUA, P

Y I, K. M., T RULLS , E., L EPETIT , V., AND FUA, P. Lift: Learned invariant feature transform. In European Confer- ence on Computer Vision (2016), Springer, pp. 467–483

work page 2016

[45] [45]

Learning to compare image patches via convolutional neural networks

Z AGORUYKO , S., AND KOMODAKIS , N. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 4353–4361

work page 2015

[46] [46]

D., AND FERGUS , R

Z EILER , M. D., AND FERGUS , R. Visualizing and under- standing convolutional networks. In European conference on computer vision (2014), Springer, pp. 818–833. 12 A Metrics deﬁnition We explicit the repeatability and matching score deﬁni- tions introduced in [26] and our adaptations using the fol- lowing notations: let (I1, I2), be a pair of images and ...

work page 2014

[47] [47]

and LF-Net [28]. Matching score The matching score deﬁnition intro- duced in [26] captures the percentage of keypoint pairs that are nearest neighbours both in image space and in descriptor space, and for which these two distances are below their respective thresholdϵkp andϵd. LetM be de- ﬁned as in the previous paragraph andMd be the analog ofM when the ...

work page

[48] [48]

Robust- ness to light (Webcam [43])

[43] [5] [43] ELF-VGG 63.81 53.23 51.84 43.73 ELF-AlexNet 51.30 38.54 35.21 31.92 ELF-Xception 48.06 49.84 29.81 35.48 ELF-SuperPoint 59.7 46.29 44.32 18.11 ELF-LFNet 60.1 41.90 44.56 33.43 LF-Net 61.16 48.27 34.19 18.10 SuperPoint 68.57 46.35 57.11 32.44 LIFT 54.66 42.21 34.02 17.83 SURF 54.51 33.93 26.10 10.13 SIFT 51.19 28.25 24.58 8.30 ORB 53.44 31.56...

work page

[49] [49]

34.19 57.11 34.02 24.58 26.10 14.76 44.19 53.71 39.48 27.03 34.97 20.04

work page

[50] [50]

9- stripes)

18.10 32.44 17.83 10.13 8.30 1.28 30.71 34.60 26.84 13.21 21.43 13.91 Table 2: Individual component performance (Fig. 9- stripes). Matching score for the integration of the VGG pool3 simple-descriptor with other’s detection. Top: Original description. Bottom: Integration of simple- descriptor. HPatches: [5]. Webcam: [43] 13 LF-Net SuperPoint LIFT SIFT SURF ORB

work page

[51] [51]

34.19 57.11 34.02 24.58 26.10 14.76 39.16 54.44 42.48 50.63 30.91 36.96

work page

[52] [52]

9- circle)

18.10 32.44 17.83 10.13 8.30 1.28 26.70 39.55 30.82 36.83 19.14 6.60 Table 3: Individual component performance (Fig. 9- circle). Matching score for the integration of ELF-VGG (on pool2) with other’s descriptor. Top: Original detec- tion. Bottom: Integration of ELF. HPatches: [5]. Web- cam: [43] Repeatability Matching Score

work page

[53] [53]

[43] [5] [43] Sobel-VGG 56.99 33.74 42.11 20.99 Lapl.-VGG 65.45 33.74 55.25 22.79 VGG 63.81 53.23 51.84 43.73 Sobel-AlexNet 56.44 33.74 30.57 15.42 Lapl.-AlexNet 65.93 33.74 40.92 15.42 AlexNet 51.30 38.54 35.21 31.92 Sobel-Xception 56.44 33.74 34.14 16.86 Lapl.-Xception 65.93 33.74 42.52 16.86 Xception 48.06 49.84 29.81 35.48 Table 4: Gradient baseline o...

work page

[54] [54]

(Fig. 10 ). C ELF Meta Parameters This section speciﬁes the meta parameters values for the ELF variants. For all methods, (wN M S,b N M S) = (10, 10). • Denoise: (µnoise,σ noise). • Threshold: (µthr,σ thr). • F l: the feature map which gradient is used for detec- tion. • simple-des: the feature map used for simple- description. Unless mentioned otherwise,...

work page