ELF: Embedded Localisation of Features in pre-trained CNN
Pith reviewed 2026-05-25 01:30 UTC · model grok-4.3
The pith
A CNN trained only for classification already embeds the location information needed to detect repeatable image keypoints.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that feature location information is embedded inside a CNN trained on standard tasks such as classification, and that this information can be recovered by computing the gradient of the feature map with respect to the input image; the resulting saliency map has local maxima at relevant keypoint locations, yielding a detector whose repeatability and matching performance on HPatches, Webcam, and photo-tourism data equals that of networks trained explicitly for keypoint detection.
What carries the argument
The gradient of the feature map with respect to the input image, which forms a saliency map whose local maxima serve as keypoint locations.
If this is right
- A single network trained for classification can supply both descriptors and detector locations.
- Feature detection becomes possible on any pre-trained CNN without collecting new keypoint labels.
- The same gradient procedure can be applied across different backbone architectures and training tasks.
- Performance remains stable under the illumination and viewpoint changes tested in the Webcam and photo-tourism sets.
Where Pith is reading between the lines
- The result suggests that spatial structure is learned as a side-effect of classification training and could be harvested for other localization tasks.
- One could test whether the same gradient extraction works on networks trained for segmentation or detection rather than pure classification.
- If the method generalizes, it would reduce the cost of deploying feature detectors in new domains where only classification data are available.
Load-bearing premise
The local maxima of the gradient-derived saliency map coincide with relevant and repeatable keypoint positions.
What would settle it
On the HPatches dataset the repeatability score of the gradient-based detector falls substantially below the scores reported for detectors trained specifically for keypoint detection.
Figures
read the original abstract
This paper introduces a novel feature detector based only on information embedded inside a CNN trained on standard tasks (e.g. classification). While previous works already show that the features of a trained CNN are suitable descriptors, we show here how to extract the feature locations from the network to build a detector. This information is computed from the gradient of the feature map with respect to the input image. This provides a saliency map with local maxima on relevant keypoint locations. Contrary to recent CNN-based detectors, this method requires neither supervised training nor finetuning. We evaluate how repeatable and how matchable the detected keypoints are with the repeatability and matching scores. Matchability is measured with a simple descriptor introduced for the sake of the evaluation. This novel detector reaches similar performances on the standard evaluation HPatches dataset, as well as comparable robustness against illumination and viewpoint changes on Webcam and photo-tourism images. These results show that a CNN trained on a standard task embeds feature location information that is as relevant as when the CNN is specifically trained for feature detection.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces ELF, a feature detector that extracts keypoint locations from the gradients of feature maps in a pre-trained CNN (trained on standard tasks like classification) to produce a saliency map whose local maxima serve as keypoints. No supervised training or finetuning is required. It reports repeatability and matching scores on HPatches comparable to specialized detectors, plus robustness to illumination and viewpoint changes on Webcam and photo-tourism images, concluding that location information is embedded in standard CNNs at a level relevant for feature detection.
Significance. If validated with rigorous controls, the result would indicate that gradient-based saliency from classification CNNs can yield repeatable keypoints without task-specific training, potentially allowing reuse of existing models for detection and reducing the need for dedicated feature-detection networks.
major comments (3)
- [Abstract / method] The central claim that gradients of an intermediate feature map produce saliency maxima on 'relevant keypoint locations' (abstract) lacks supporting analysis of what the saliency responds to. Classification-trained gradients typically emphasize semantic or class-discriminative regions rather than viewpoint-invariant local structures such as corners; without layer-specific ablation or response characterization (e.g., edge vs. blob), the 'embedded localisation' interpretation is not load-bearing.
- [Evaluation] Matchability is evaluated with a simple descriptor introduced solely for this paper. This choice prevents direct comparison to prior detectors that use standard or learned descriptors, undermining the claim of 'similar performances' on HPatches.
- [Experiments] No control experiment is described that isolates the contribution of the gradient step (e.g., showing that the same CNN without gradient-based localization fails at repeatable detection). This is required to establish that the observed scores arise from embedded location information rather than incidental properties of the feature maps.
minor comments (2)
- [Abstract] The abstract and evaluation sections should explicitly state the CNN architecture, layer index, and exact procedure (thresholding, non-maximum suppression) used to extract local maxima from the saliency map.
- [Evaluation] Quantitative tables with error bars, number of images, and statistical significance tests are needed to support the 'comparable' and 'robust' claims; their absence makes the reported scores difficult to interpret.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.
read point-by-point responses
-
Referee: [Abstract / method] The central claim that gradients of an intermediate feature map produce saliency maxima on 'relevant keypoint locations' (abstract) lacks supporting analysis of what the saliency responds to. Classification-trained gradients typically emphasize semantic or class-discriminative regions rather than viewpoint-invariant local structures such as corners; without layer-specific ablation or response characterization (e.g., edge vs. blob), the 'embedded localisation' interpretation is not load-bearing.
Authors: We agree that additional characterization of the saliency response would strengthen the interpretation of the results. In the revised manuscript we will add layer-specific ablations (showing repeatability across different convolutional layers) together with qualitative examples and quantitative measures of the types of structures (edges, corners, blobs) to which the gradient-based saliency responds on the evaluation datasets. revision: yes
-
Referee: [Evaluation] Matchability is evaluated with a simple descriptor introduced solely for this paper. This choice prevents direct comparison to prior detectors that use standard or learned descriptors, undermining the claim of 'similar performances' on HPatches.
Authors: The simple descriptor was introduced to isolate detector performance from descriptor quality. We acknowledge that this limits direct comparability with prior work. In the revision we will report additional matching scores on HPatches using a standard hand-crafted descriptor (SIFT) and, where feasible, a learned descriptor, allowing readers to compare ELF keypoints against published detector-descriptor combinations. revision: yes
-
Referee: [Experiments] No control experiment is described that isolates the contribution of the gradient step (e.g., showing that the same CNN without gradient-based localization fails at repeatable detection). This is required to establish that the observed scores arise from embedded location information rather than incidental properties of the feature maps.
Authors: We will add a control experiment that replaces the gradient-based saliency map with the raw feature-map activations (or their spatial maximum) as the keypoint source, keeping the same CNN and post-processing. This will quantify the specific contribution of the gradient computation to repeatability and matchability. revision: yes
Circularity Check
No circularity; empirical method and evaluation are self-contained
full rationale
The paper presents an empirical detector that computes saliency from gradients of a pre-trained CNN feature map and evaluates repeatability/matching on HPatches, Webcam, and photo-tourism data. No equations, parameter fits, self-citations, or uniqueness theorems are invoked that would reduce the central claim (embedded location information is comparably relevant) to a definition or input quantity by construction. The derivation chain consists of a proposed extraction procedure plus external dataset benchmarks, with no load-bearing steps that collapse into the method's own choices.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Gradient of feature map w.r.t. input image yields a saliency map with local maxima at relevant keypoints
Reference graph
Works this paper leans on
-
[1]
https: //image-matching-workshop.github.io/ challenge/, 2019
Cvpr19 image matching challenge. https: //image-matching-workshop.github.io/ challenge/, 2019
work page 2019
-
[2]
A ANÆS , H., D AHL , A. L., AND PEDERSEN , K. S. Inter- esting interest points. International Journal of Computer Vision 97, 1 (2012), 18–35
work page 2012
-
[3]
Tensorflow: a system for large-scale machine learning
A BADI , M., B ARHAM , P., C HEN , J., C HEN , Z., D AVIS, A., D EAN , J., D EVIN , M., G HEMAWAT, S., I RVING , G., ISARD , M., ET AL . Tensorflow: a system for large-scale machine learning. In OSDI (2016), vol. 16, pp. 265–283
work page 2016
-
[4]
F., B ARTOLI , A., AND DAVISON , A
A LCANTARILLA , P. F., B ARTOLI , A., AND DAVISON , A. J. Kaze features. In European Conference on Com- puter Vision (2012), Springer, pp. 214–227
work page 2012
-
[5]
Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors
B ALNTAS , V., L ENC , K., V EDALDI , A., AND MIKO- LAJCZYK , K. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Confer- ence on Computer Vision and Pattern Recognition (CVPR) (2017), vol. 4, p. 6
work page 2017
-
[6]
Learning local feature descriptors with triplets and shallowconvolutional neural networks
B ALNTAS , V., R IBA, E., P ONSA , D., AND MIKOLA - JCZYK , K. Learning local feature descriptors with triplets and shallowconvolutional neural networks. In BMVC (2016), vol. 1, p. 3
work page 2016
-
[7]
Surf: Speeded up robust features
B AY, H., T UYTELAARS , T., AND VAN GOOL , L. Surf: Speeded up robust features. In European conference on computer vision (2006), Springer, pp. 404–417
work page 2006
-
[8]
Brief: Binary robust independent elementary fea- tures
C ALONDER , M., L EPETIT , V., S TRECHA , C., AND FUA, P. Brief: Binary robust independent elementary fea- tures. In European conference on computer vision (2010), Springer, pp. 778–792
work page 2010
-
[9]
Xception: Deep learning with depthwise separable convolutions
C HOLLET , F. Xception: Deep learning with depthwise separable convolutions. In2017 IEEE Conference on Com- puter Vision and Pattern Recognition, CVPR 2017, Hon- olulu, HI, USA, July 21-26, 2017 (2017), pp. 1800–1807
work page 2017
-
[10]
B., G WAK, J., S AVARESE , S., AND CHAN - DRAKER , M
C HOY, C. B., G WAK, J., S AVARESE , S., AND CHAN - DRAKER , M. Universal correspondence network. In Ad- vances in Neural Information Processing Systems (2016), pp. 2414–2422
work page 2016
-
[11]
Imagenet: A large-scale hierarchical im- age database
D ENG , J., D ONG , W., S OCHER , R., L I, L.-J., L I, K., AND FEI-F EI, L. Imagenet: A large-scale hierarchical im- age database. In Computer Vision and Pattern Recogni- tion, 2009. CVPR 2009. IEEE Conference on (2009), Ieee, pp. 248–255
work page 2009
-
[12]
Superpoint: Self-supervised interest point detection and description
D ETONE , D., M ALISIEWICZ , T., AND RABINOVICH , A. Superpoint: Self-supervised interest point detection and description. In CVPR Deep Learning for Visual SLAM Workshop (2018)
work page 2018
-
[13]
Descriptor Matching with Convolutional Neural Networks: a Comparison to SIFT
F ISCHER , P., DOSOVITSKIY , A., AND BROX, T. Descrip- tor matching with convolutional neural networks: a com- parison to sift. arXiv preprint arXiv:1405.5769 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[14]
G ATYS, L. A., E CKER , A. S., AND BETHGE , M. Image style transfer using convolutional neural networks. In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2016), pp. 2414–2423
work page 2016
-
[15]
H AN, X., L EUNG , T., J IA, Y., S UKTHANKAR , R., AND BERG , A. C. Matchnet: Unifying feature and metric learn- ing for patch-based matching. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3279–3286
work page 2015
-
[16]
L., D UNN , E., AND FRAHM , J.-M
H EINLY, J., S CHONBERGER , J. L., D UNN , E., AND FRAHM , J.-M. Reconstructing the world* in six days*(as captured by the yahoo 100 million image dataset). In Pro- ceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 3287–3295
work page 2015
-
[17]
K APUR , J. N., S AHOO , P. K., AND WONG , A. K. A new method for gray-level picture thresholding using the entropy of the histogram. Computer vision, graphics, and image processing 29, 3 (1985), 273–285
work page 1985
-
[18]
K RIZHEVSKY , A., S UTSKEVER , I., AND HINTON , G. E. Imagenet classification with deep convolutional neural net- works. In Advances in neural information processing sys- tems (2012), pp. 1097–1105
work page 2012
-
[19]
L ENC , K., G ULSHAN , V., AND VEDALDI , A. Vlbenchmkars. http://www.vlfeat.org/ benchmarks/xsxs, 2011
work page 2011
-
[20]
L IN, T.-Y., M AIRE , M., B ELONGIE , S., H AYS, J., P ER- ONA, P., R AMANAN , D., D OLL ´AR, P., AND ZITNICK , C. L. Microsoft coco: Common objects in context. In European conference on computer vision (2014), Springer, pp. 740–755
work page 2014
-
[21]
L OWE, D. G. Distinctive image features from scale- invariant keypoints. International journal of computer vi- sion 60, 2 (2004), 91–110
work page 2004
-
[22]
Understanding deep image representations by inverting them
M AHENDRAN , A., AND VEDALDI , A. Understanding deep image representations by inverting them. In Proceed- ings of the IEEE conference on computer vision and pat- tern recognition (2015), pp. 5188–5196
work page 2015
-
[23]
Robust wide-baseline stereo from maximally stable ex- tremal regions
M ATAS, J., C HUM , O., U RBAN , M., AND PAJDLA , T. Robust wide-baseline stereo from maximally stable ex- tremal regions. Image and vision computing 22, 10 (2004), 761–767
work page 2004
-
[24]
Siamese network features for image matching
M ELEKHOV , I., K ANNALA , J., AND RAHTU , E. Siamese network features for image matching. In 2016 23rd In- ternational Conference on Pattern Recognition (ICPR) (2016), IEEE, pp. 378–383
work page 2016
-
[25]
A performance evaluation of local descriptors
M IKOLAJCZYK , K., AND SCHMID , C. A performance evaluation of local descriptors. IEEE transactions on pattern analysis and machine intelligence 27 , 10 (2005), 1615–1630. 11
work page 2005
-
[26]
A comparison of affine region detectors
M IKOLAJCZYK , K., T UYTELAARS , T., S CHMID , C., ZISSERMAN , A., M ATAS, J., S CHAFFALITZKY , F., KADIR , T., AND VAN GOOL , L. A comparison of affine region detectors. International journal of computer vision 65, 1-2 (2005), 43–72
work page 2005
-
[27]
Largescale image retrieval with attentive deep local features
N OH, H., A RAUJO , A., S IM, J., W EYAND , T., AND HAN, B. Largescale image retrieval with attentive deep local features. In Proceedings of the IEEE International Conference on Computer Vision (2017), pp. 3456–3465
work page 2017
-
[28]
Lf-net: Learning local features from images
O NO, Y., T RULLS , E., F UA, P., AND K.M.Y I. Lf-net: Learning local features from images. In Advances in Neu- ral Information Processing Systems (2018)
work page 2018
-
[29]
Machine learning for high-speed corner detection
R OSTEN , E., AND DRUMMOND , T. Machine learning for high-speed corner detection. In European conference on computer vision (2006), Springer, pp. 430–443
work page 2006
-
[30]
Orb: An efficient alternative to sift or surf
R UBLEE , E., R ABAUD , V., K ONOLIGE , K., AND BRAD - SKI , G. Orb: An efficient alternative to sift or surf. In Computer Vision (ICCV), 2011 IEEE international confer- ence on (2011), IEEE, pp. 2564–2571
work page 2011
-
[31]
Quad-networks: unsupervised learning to rank for interest point detection
S AVINOV, N., S EKI , A., L ADICKY , L., S ATTLER , T., AND POLLEFEYS , M. Quad-networks: unsupervised learning to rank for interest point detection. In Proc. IEEE Conference on Computer Vision and Pattern Recognition (CVPR) (2017)
work page 2017
-
[32]
S CHONBERGER , J. L., AND FRAHM , J.-M. Structure- from-motion revisited. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (2016), pp. 4104–4113
work page 2016
-
[33]
R., C OGSWELL , M., D AS, A., V EDAN - TAM, R., P ARIKH , D., B ATRA, D., ET AL
S ELVARAJU , R. R., C OGSWELL , M., D AS, A., V EDAN - TAM, R., P ARIKH , D., B ATRA, D., ET AL . Grad-cam: Vi- sual explanations from deep networks via gradient-based localization. In ICCV (2017), pp. 618–626
work page 2017
-
[34]
Discrimi- native learning of deep convolutional feature point descrip- tors
S IMO -S ERRA , E., T RULLS , E., F ERRAZ , L., K OKKI - NOS , I., F UA, P., AND MORENO -NOGUER , F. Discrimi- native learning of deep convolutional feature point descrip- tors. In Proceedings of the IEEE International Conference on Computer Vision (2015), pp. 118–126
work page 2015
-
[35]
Deep Inside Convolutional Networks: Visualising Image Classification Models and Saliency Maps
S IMONYAN , K., V EDALDI , A., AND ZISSERMAN , A. Deep inside convolutional networks: Visualising image classification models and saliency maps. arXiv preprint arXiv:1312.6034 (2013)
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[36]
Very Deep Convolutional Networks for Large-Scale Image Recognition
S IMONYAN , K., AND ZISSERMAN , A. Very deep convo- lutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556 (2014)
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[37]
SmoothGrad: removing noise by adding noise
S MILKOV, D., T HORAT, N., K IM, B., V I ´EGAS , F., AND WATTENBERG , M. Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825 (2017)
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[38]
Striving for simplicity: The all convo- lutional net
S PRINGENBERG , J., D OSOVITSKIY , A., B ROX, T., AND RIEDMILLER , M. Striving for simplicity: The all convo- lutional net. In ICLR (workshop track) (2015)
work page 2015
-
[39]
On benchmarking camera calibration and multi-view stereo for high resolution im- agery
S TRECHA , C., V ON HANSEN , W., VAN GOOL , L., F UA, P., AND THOENNESSEN , U. On benchmarking camera calibration and multi-view stereo for high resolution im- agery. In Computer Vision and Pattern Recognition, 2008. CVPR 2008. IEEE Conference on (2008), Ieee, pp. 1–8
work page 2008
-
[40]
Ax- iomatic attribution for deep networks
S UNDARARAJAN , M., T ALY, A., AND YAN, Q. Ax- iomatic attribution for deep networks. In International Conference on Machine Learning (2017), pp. 3319–3328
work page 2017
-
[41]
Inloc: Indoor visual localization with dense matching and view synthesis
T AIRA , H., O KUTOMI , M., S ATTLER , T., C IMPOI , M., POLLEFEYS , M., S IVIC , J., P AJDLA , T., AND TORII , A. Inloc: Indoor visual localization with dense matching and view synthesis. In Proceedings of the IEEE Confer- ence on Computer Vision and Pattern Recognition (2018), pp. 7199–7209
work page 2018
-
[42]
A., F RIEDLAND , G., ELIZALDE , B., N I, K., P OLAND , D., B ORTH , D., AND LI, L.-J
T HOMEE , B., S HAMMA , D. A., F RIEDLAND , G., ELIZALDE , B., N I, K., P OLAND , D., B ORTH , D., AND LI, L.-J. Yfcc100m: The new data in multimedia research. Communications of the ACM 59 , 2, 64–73
-
[43]
Tilde: A temporally invariant learned detector
V ERDIE , Y., Y I, K., F UA, P., AND LEPETIT , V. Tilde: A temporally invariant learned detector. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2015), pp. 5279–5288
work page 2015
-
[44]
M., T RULLS , E., L EPETIT , V., AND FUA, P
Y I, K. M., T RULLS , E., L EPETIT , V., AND FUA, P. Lift: Learned invariant feature transform. In European Confer- ence on Computer Vision (2016), Springer, pp. 467–483
work page 2016
-
[45]
Learning to compare image patches via convolutional neural networks
Z AGORUYKO , S., AND KOMODAKIS , N. Learning to compare image patches via convolutional neural networks. In Proceedings of the IEEE conference on computer vision and pattern recognition (2015), pp. 4353–4361
work page 2015
-
[46]
Z EILER , M. D., AND FERGUS , R. Visualizing and under- standing convolutional networks. In European conference on computer vision (2014), Springer, pp. 818–833. 12 A Metrics definition We explicit the repeatability and matching score defini- tions introduced in [26] and our adaptations using the fol- lowing notations: let (I1, I2), be a pair of images and ...
work page 2014
-
[47]
and LF-Net [28]. Matching score The matching score definition intro- duced in [26] captures the percentage of keypoint pairs that are nearest neighbours both in image space and in descriptor space, and for which these two distances are below their respective thresholdϵkp andϵd. LetM be de- fined as in the previous paragraph andMd be the analog ofM when the ...
-
[48]
Robust- ness to light (Webcam [43])
[43] [5] [43] ELF-VGG 63.81 53.23 51.84 43.73 ELF-AlexNet 51.30 38.54 35.21 31.92 ELF-Xception 48.06 49.84 29.81 35.48 ELF-SuperPoint 59.7 46.29 44.32 18.11 ELF-LFNet 60.1 41.90 44.56 33.43 LF-Net 61.16 48.27 34.19 18.10 SuperPoint 68.57 46.35 57.11 32.44 LIFT 54.66 42.21 34.02 17.83 SURF 54.51 33.93 26.10 10.13 SIFT 51.19 28.25 24.58 8.30 ORB 53.44 31.56...
-
[49]
34.19 57.11 34.02 24.58 26.10 14.76 44.19 53.71 39.48 27.03 34.97 20.04
-
[50]
18.10 32.44 17.83 10.13 8.30 1.28 30.71 34.60 26.84 13.21 21.43 13.91 Table 2: Individual component performance (Fig. 9- stripes). Matching score for the integration of the VGG pool3 simple-descriptor with other’s detection. Top: Original description. Bottom: Integration of simple- descriptor. HPatches: [5]. Webcam: [43] 13 LF-Net SuperPoint LIFT SIFT SURF ORB
-
[51]
34.19 57.11 34.02 24.58 26.10 14.76 39.16 54.44 42.48 50.63 30.91 36.96
-
[52]
18.10 32.44 17.83 10.13 8.30 1.28 26.70 39.55 30.82 36.83 19.14 6.60 Table 3: Individual component performance (Fig. 9- circle). Matching score for the integration of ELF-VGG (on pool2) with other’s descriptor. Top: Original detec- tion. Bottom: Integration of ELF. HPatches: [5]. Web- cam: [43] Repeatability Matching Score
-
[53]
[43] [5] [43] Sobel-VGG 56.99 33.74 42.11 20.99 Lapl.-VGG 65.45 33.74 55.25 22.79 VGG 63.81 53.23 51.84 43.73 Sobel-AlexNet 56.44 33.74 30.57 15.42 Lapl.-AlexNet 65.93 33.74 40.92 15.42 AlexNet 51.30 38.54 35.21 31.92 Sobel-Xception 56.44 33.74 34.14 16.86 Lapl.-Xception 65.93 33.74 42.52 16.86 Xception 48.06 49.84 29.81 35.48 Table 4: Gradient baseline o...
-
[54]
(Fig. 10 ). C ELF Meta Parameters This section specifies the meta parameters values for the ELF variants. For all methods, (wN M S,b N M S) = (10, 10). • Denoise: (µnoise,σ noise). • Threshold: (µthr,σ thr). • F l: the feature map which gradient is used for detec- tion. • simple-des: the feature map used for simple- description. Unless mentioned otherwise,...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.