Unsupervised Learning Framework of Interest Point Via Properties Optimization

Cai Wen; Pei Yan; Yihua Tan; Yuan Tai; Yuan Xiao

arxiv: 1907.11375 · v1 · pith:BDUU2CXYnew · submitted 2019-07-26 · 💻 cs.CV

Unsupervised Learning Framework of Interest Point Via Properties Optimization

Pei Yan , Yihua Tan , Yuan Xiao , Yuan Tai , Cai Wen This is my paper

Pith reviewed 2026-05-24 16:09 UTC · model grok-4.3

classification 💻 cs.CV

keywords unsupervised learninginterest point detectionfeature descriptorexpectation maximizationimage matchingproperty optimizationfully convolutional network

0 comments

The pith

An unsupervised framework jointly trains interest point detectors and descriptors by optimizing sparsity, repeatability and discriminability as joint probabilities via EM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that interest points can be learned entirely without labels by defining their key properties as probabilities and maximizing their joint distribution with a latent-variable EM procedure. A sympathetic reader would care because supervised training of detectors requires expensive point annotations that often fail to transfer across domains, while this approach claims to produce a single network that works on diverse scenes out of the box. The method instantiates both the detector and descriptor inside a fully convolutional Property Network and shows it exceeds prior art on standard matching benchmarks. The optimization uses a mini-batch approximation of EM to keep computation feasible on large image collections.

Core claim

By treating sparsity, repeatability and discriminability as a joint probability over extracted points and introducing a latent variable for the probability that any given point satisfies the required properties, the training objective can be maximized with the EM algorithm; the resulting Property Network, implemented as fully convolutional networks, produces detectors and descriptors that outperform state-of-the-art methods on multiple image matching benchmarks without any retraining or labeled data.

What carries the argument

Joint probability distribution over sparsity, repeatability and discriminability, maximized by latent-variable Expectation Maximization with mini-batch approximation.

If this is right

Detector and descriptor can be learned jointly in a single unsupervised pass without ground-truth correspondences.
The same network generalizes to new scene types without retraining or fine-tuning.
Mini-batch EM makes the probabilistic objective tractable for large unlabeled image collections.
The framework can instantiate different network architectures while keeping the same property-based objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the probabilistic formulation holds, explicit supervision may be unnecessary once the right properties are encoded as probabilities rather than as hand-labeled points.
The approach could be tested on tasks beyond rigid matching, such as non-rigid deformation or low-texture environments, to see whether the same three properties suffice.
Replacing the fully convolutional backbone with other architectures would test whether the gains come mainly from the objective or from the network choice.

Load-bearing premise

That the three properties can be expressed as joint probabilities whose maximization through EM produces points that reliably correspond across real image transformations.

What would settle it

Evaluating the trained Property Network on HPatches or a similar benchmark and finding its matching or repeatability scores fall below those of a standard supervised baseline such as SIFT or SuperPoint.

Figures

Figures reproduced from arXiv: 1907.11375 by Cai Wen, Pei Yan, Yihua Tan, Yuan Tai, Yuan Xiao.

**Figure 1.** Figure 1: Overview of unsupervised training framework via optimizing the properties of interest point. It consists of three parts: (1) training [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗

**Figure 2.** Figure 2: Visual matching results of state-of-the-art algorithms and our PN-i-64 model (M-score indicates Matching Score, and Homo-error [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Visual matching results of Superpoint and different Property Networks. First col: PN-i-64 model, second col: PN-v-64 model, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of Property Network. Original Image Homography Blur Channels Shuffle Contrast Normalization Grayscale Invert Salt and Pepper Shadow [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗

**Figure 5.** Figure 5: Examples of simulations for illumination and viewpoint [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

**Figure 6.** Figure 6: Visual matching results of state-of-the-art algorithms and our PN-i-64 model (M-score indicates Matching Score, and Homo-error [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

read the original abstract

This paper presents an entirely unsupervised interest point training framework by jointly learning detector and descriptor, which takes an image as input and outputs a probability and a description for every image point. The objective of the training framework is formulated as joint probability distribution of the properties of the extracted points. The essential properties are selected as sparsity, repeatability and discriminability which are formulated by the probabilities. To maximize the objective efficiently, latent variable is introduced to represent the probability of that a point satisfies the required properties. Therefore, original maximization can be optimized with Expectation Maximization algorithm (EM). Considering high computation cost of EM on large scale image set, we implement the optimization process with an efficient strategy as Mini-Batch approximation of EM (MBEM). In the experiments both detector and descriptor are instantiated with fully convolutional network which is named as Property Network (PN). The experiments demonstrate that PN outperforms state-of-the-art methods on a number of image matching benchmarks without need of retraining. PN also reveals that the proposed training framework has high flexibility to adapt to diverse types of scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This paper proposes an entirely unsupervised framework for jointly learning interest point detectors and descriptors. An image is input to a network that outputs a probability and descriptor for every point; the training objective is defined as the joint probability distribution over three properties (sparsity, repeatability, discriminability) expressed as probabilities. A latent variable is introduced to represent whether a point satisfies the properties, allowing optimization via the EM algorithm, which is approximated by mini-batch EM (MBEM) for scalability. Both components are realized as a fully convolutional Property Network (PN) that is reported to outperform prior methods on image-matching benchmarks without retraining and to adapt to diverse scenes.

Significance. If the probability formulations and latent-variable EM construction can be shown to produce detectors and descriptors that genuinely encode repeatability and discriminability without supervision or known correspondences, the result would be significant: it would remove the requirement for labeled data or synthetic warps that currently limit supervised interest-point methods and could improve cross-scene generalization.

major comments (2)

[Abstract] The central claim rests on the definitions of the three properties as joint probabilities and on the latent-variable EM step (via MBEM) actually enforcing repeatability and discriminability. The abstract supplies no equations for these probabilities, so it is impossible to verify whether the objective admits trivial solutions (e.g., uniform or edge-only selections) or whether the unsupervised proxy truly substitutes for cross-image correspondences.
[Abstract] The performance claim that PN outperforms state-of-the-art methods on multiple benchmarks without retraining is load-bearing for the contribution. No quantitative results, benchmark names, or comparison tables are supplied, preventing assessment of whether the reported gains are statistically meaningful or whether the baselines were evaluated under identical conditions.

minor comments (1)

The description of the fully convolutional architecture for simultaneous detection and description could be expanded with layer counts, receptive-field sizes, and how the probability and descriptor heads share features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, with proposed revisions to strengthen the abstract while preserving its conciseness.

read point-by-point responses

Referee: [Abstract] The central claim rests on the definitions of the three properties as joint probabilities and on the latent-variable EM step (via MBEM) actually enforcing repeatability and discriminability. The abstract supplies no equations for these probabilities, so it is impossible to verify whether the objective admits trivial solutions (e.g., uniform or edge-only selections) or whether the unsupervised proxy truly substitutes for cross-image correspondences.

Authors: The abstract is intentionally concise. The probability formulations for sparsity (modeled as a Bernoulli process encouraging sparse selection), repeatability (probability of consistent detection under transformations), and discriminability (probability of unique descriptors), along with the latent variable z indicating property satisfaction and the MBEM optimization, are fully derived in Sections 3.1–3.3. The joint objective is constructed to preclude trivial solutions because the discriminability term penalizes descriptor collisions across points, and the EM procedure alternates between inferring latent assignments and maximizing the expected joint probability. We will revise the abstract to explicitly name the three properties and reference the EM-based optimization to better convey the unsupervised proxy for correspondences. revision: partial
Referee: [Abstract] The performance claim that PN outperforms state-of-the-art methods on multiple benchmarks without retraining is load-bearing for the contribution. No quantitative results, benchmark names, or comparison tables are supplied, preventing assessment of whether the reported gains are statistically meaningful or whether the baselines were evaluated under identical conditions.

Authors: We agree that greater specificity would strengthen the abstract. Section 4 presents quantitative results with tables comparing PN against SuperPoint, LIFT, and other baselines on HPatches, Oxford Affine, and additional matching benchmarks, using identical evaluation protocols (e.g., matching score at 5-pixel threshold) and reporting mean improvements without any retraining on the test sets. We will revise the abstract to name the primary benchmarks and indicate that detailed tables appear in the experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: objective defined from external properties and optimized via standard EM

full rationale

The derivation defines the training objective externally as the joint probability over three standard interest-point properties (sparsity, repeatability, discriminability) and applies latent-variable EM (MBEM) to maximize it. No equation or step reduces the target to a fitted parameter or self-citation by construction; the unsupervised proxy is motivated independently of the final benchmark performance, and the network instantiation is a conventional FCN. The central claim therefore rests on the empirical transfer of the learned detector/descriptor rather than on any definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that the three listed properties can be expressed as probabilities suitable for EM; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Interest points possess the essential properties of sparsity, repeatability and discriminability which can be formulated by probabilities.
Directly stated in the abstract as the basis for the objective.

pith-pipeline@v0.9.0 · 5716 in / 1180 out tokens · 24710 ms · 2026-05-24T16:09:20.123625+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

[1]

Freak: Fast retina keypoint

Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. Freak: Fast retina keypoint. In IEEE Conference on Com- puter Vision & Pattern Recognition, pages 510–517, 2012

work page 2012
[2]

Kaze features

Pablo Fern ´andez Alcantarilla, Adrien Bartoli, and Andrew J Davison. Kaze features. In European Conference on Com- puter Vision, pages 214–227. Springer, 2012

work page 2012
[3]

Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5173–5182, 2017

work page 2017
[4]

Surf: Speeded up robust features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on com- puter vision, pages 404–417. Springer, 2006

work page 2006
[5]

Brief: Binary robust independent elementary features

Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. In European conference on computer vision, pages 778–792. Springer, 2010

work page 2010
[6]

Matching features without descriptors: Implicitly matched interest points (imips)

Titus Cieslewski, Michael Bloesch, and Davide Scaramuzza. Matching features without descriptors: Implicitly matched interest points (imips). arXiv preprint arXiv:1811.10681 , 2018

work page arXiv 2018
[7]

Sips: un- supervised succinct interest points

Titus Cieslewski and Davide Scaramuzza. Sips: un- supervised succinct interest points. arXiv preprint arXiv:1805.01358, 2018

work page arXiv 2018
[8]

Group equivariant convo- lutional networks

Taco Cohen and Max Welling. Group equivariant convo- lutional networks. In International conference on machine learning, pages 2990–2999, 2016

work page 2016
[9]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 224–236, 2018

work page 2018
[10]

A combined cor- ner and edge detector

Christopher G Harris, Mike Stephens, et al. A combined cor- ner and edge detector. InAlvey vision conference, volume 15, pages 10–5244. Citeseer, 1988

work page 1988
[11]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[13]

Brisk: Binary robust invariant scalable keypoints

Stefan Leutenegger, Margarita Chli, and Roland Siegwart. Brisk: Binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV) , pages 2548–2555. Ieee, 2011

work page 2011
[14]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

work page 2014
[15]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015

work page 2015
[16]

Distinctive image features from scale- invariant keypoints

David G Lowe. Distinctive image features from scale- invariant keypoints. International journal of computer vi- sion, 60(2):91–110, 2004

work page 2004
[17]

A performance evaluation of local descriptors

Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. IEEE transactions on pat- tern analysis and machine intelligence , 27(10):1615–1630, 2005

work page 2005
[18]

A comparison of afﬁne region detectors

Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and Luc Van Gool. A comparison of afﬁne region detectors. International journal of computer vision , 65(1- 2):43–72, 2005

work page 2005
[19]

Lf-net: learning local features from images

Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. Lf-net: learning local features from images. In Advances in Neural Information Processing Systems , pages 6237–6247, 2018

work page 2018
[20]

Machine learning for high-speed corner detection

Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. In European conference on computer vision, pages 430–443. Springer, 2006

work page 2006
[21]

Quad-networks: unsupervised learning to rank for interest point detection

Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sat- tler, and Marc Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1822–1830, 2017

work page 2017
[22]

Evaluation of interest point detectors

Cordelia Schmid, Roger Mohr, and Christian Bauckhage. Evaluation of interest point detectors. International Journal of computer vision, 37(2):151–172, 2000

work page 2000
[23]

Training for task speciﬁc keypoint detection

Christoph Strecha, Albrecht Lindner, Karim Ali, and Pascal Fua. Training for task speciﬁc keypoint detection. In Joint Pattern Recognition Symposium , pages 151–160. Springer, 2009

work page 2009
[24]

Tilde: a temporally invariant learned detector

Yannick Verdie, Kwang Yi, Pascal Fua, and Vincent Lepetit. Tilde: a temporally invariant learned detector. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5279–5288, 2015

work page 2015
[25]

Worrall, Stephan J

Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukham- betov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. InIEEE Conference on Computer Vision & Pattern Recognition, pages 5028–5037, 2017

work page 2017
[26]

Lift: Learned invariant feature transform

Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. In European Conference on Computer Vision , pages 467–483. Springer, 2016

work page 2016
[27]

Deconvolutional networks

Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. 2010. Supplementary Material In this supplementary material we give more details of our implementation and experimential results. In section 1 we implement the Expectation Maximization (EM) process with an efﬁcient strategy called Mini-Batch approximation of EM (M...

work page 2010
[28]

Mini-Batch Approximation of Expectation Maxmization 1.1. Problems of Original Expectation Maxmization Before discuss the difﬁcultes of optimizing our objective with original Expectation Maxmization algorithm (EM), we ﬁrst overview the objective of our training framework, which is formulated as { arg max θF,θD,y1,.,yK P (θF,θD,y 1,.,y K) =∏ kr ( yk) · ˆc (...

work page
[29]

It’s easy to check whether a speciﬁcy(m) satisﬁes sparsity constraint or not, but it’s difﬁcult to straightforward obtain all satisﬁedy(m)

The deﬁnition of sample spaceY is integrated with spar- sity constraint. It’s easy to check whether a speciﬁcy(m) satisﬁes sparsity constraint or not, but it’s difﬁcult to straightforward obtain all satisﬁedy(m)

work page
[30]

Whereas it’s hard to solve (41) precisely, this subsec- tion introduce an efﬁcient strategy to approximatepi which comprises of three step

To solve distribution pi, we must consider all points be- cause ˆci depends on vectory. Whereas it’s hard to solve (41) precisely, this subsec- tion introduce an efﬁcient strategy to approximatepi which comprises of three step. First, we approximate sample space Y with ˆY which can be obtained efﬁciently. Second, the ˆci(y) is approximated with ˆci(ˆy) wh...

work page
[31]

In PN, both detector and de- scriptor are implemented with Fully Convolutional Network [15, 27], whose architecture is inspired by [9]

Architecture of Property Network Property Network (PN) is a speciﬁc implementation of our training framework. In PN, both detector and de- scriptor are implemented with Fully Convolutional Network [15, 27], whose architecture is inspired by [9]. Figure 4 shows the architecture of PN brieﬂy. Detector comprises of encoder and detection decoder, and descript...

work page
[32]

For illumination simulation we randomly select the transformations to change pixel value

Simulations for Viewpoint and Illumination Changes In this section we outline our simulations for illumina- tion and viewpoint changes. For illumination simulation we randomly select the transformations to change pixel value. For viewpoint simulation we randomly generate homogra- phy matrices used to perform homography transformation. Figure 5 shows sever...

work page
[33]

Apply Gaussian blur, average blur or me- dian bluron image

Image Blur. Apply Gaussian blur, average blur or me- dian bluron image

work page
[34]

Permute the order of the color chan- nels of image

Channels Shufﬂe. Permute the order of the color chan- nels of image

work page
[35]

Change the contrast in images by moving pixel values away or closer to 128

Contrast Normalization. Change the contrast in images by moving pixel values away or closer to 128

work page
[36]

Convert images to grayscale and mixe with the original image with a random weight

Grayscale. Convert images to grayscale and mixe with the original image with a random weight

work page
[37]

set them to 255− original pixel value

Invert all pixels in given image, i.e. set them to 255− original pixel value

work page
[38]

Randomly replace some pixels with very white or black colors

Salt and Pepper noise. Randomly replace some pixels with very white or black colors

work page
[39]

Randomly insert some dark shapes into image

Shadow. Randomly insert some dark shapes into image. In main text we have mentioned different PN models are trained with different level of simulations. PN-v-64 is trained with only Image Blur, Contrast Normalization and Shadow, and PN-i-64 and PN-128 are trained with all kind of simulations for illumination changes. In simulations for viewpoint changes, ...

work page
[40]

In this section we introduce their deﬁnitions

Performance Metrics Two metrics used in our experiment are Matching Score and Homography Estimation, which are identical to that used in [9]. In this section we introduce their deﬁnitions. 4.1. Matching Score Matching Score measures the overall performance of in- terest point detector and descriptor. It measures the ratio of ground truth correspondences t...

work page
[41]

In this subsection we give results of different methods on image size 320× 240

More Experimential Results In main text we have demostrated results on image size 640× 480. In this subsection we give results of different methods on image size 320× 240. All experiment conﬁg- ures are same as that for image size 640× 480 except the maximum number of extracted interest points. We keep no more than 300 interest points for 320× 240 images....

work page

[1] [1]

Freak: Fast retina keypoint

Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. Freak: Fast retina keypoint. In IEEE Conference on Com- puter Vision & Pattern Recognition, pages 510–517, 2012

work page 2012

[2] [2]

Kaze features

Pablo Fern ´andez Alcantarilla, Adrien Bartoli, and Andrew J Davison. Kaze features. In European Conference on Com- puter Vision, pages 214–227. Springer, 2012

work page 2012

[3] [3]

Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors

Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5173–5182, 2017

work page 2017

[4] [4]

Surf: Speeded up robust features

Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on com- puter vision, pages 404–417. Springer, 2006

work page 2006

[5] [5]

Brief: Binary robust independent elementary features

Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. In European conference on computer vision, pages 778–792. Springer, 2010

work page 2010

[6] [6]

Matching features without descriptors: Implicitly matched interest points (imips)

Titus Cieslewski, Michael Bloesch, and Davide Scaramuzza. Matching features without descriptors: Implicitly matched interest points (imips). arXiv preprint arXiv:1811.10681 , 2018

work page arXiv 2018

[7] [7]

Sips: un- supervised succinct interest points

Titus Cieslewski and Davide Scaramuzza. Sips: un- supervised succinct interest points. arXiv preprint arXiv:1805.01358, 2018

work page arXiv 2018

[8] [8]

Group equivariant convo- lutional networks

Taco Cohen and Max Welling. Group equivariant convo- lutional networks. In International conference on machine learning, pages 2990–2999, 2016

work page 2016

[9] [9]

Superpoint: Self-supervised interest point detection and description

Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 224–236, 2018

work page 2018

[10] [10]

A combined cor- ner and edge detector

Christopher G Harris, Mike Stephens, et al. A combined cor- ner and edge detector. InAlvey vision conference, volume 15, pages 10–5244. Citeseer, 1988

work page 1988

[11] [11]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

Adam: A Method for Stochastic Optimization

Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014

[13] [13]

Brisk: Binary robust invariant scalable keypoints

Stefan Leutenegger, Margarita Chli, and Roland Siegwart. Brisk: Binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV) , pages 2548–2555. Ieee, 2011

work page 2011

[14] [14]

Microsoft coco: Common objects in context

Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

work page 2014

[15] [15]

Fully convolutional networks for semantic segmentation

Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015

work page 2015

[16] [16]

Distinctive image features from scale- invariant keypoints

David G Lowe. Distinctive image features from scale- invariant keypoints. International journal of computer vi- sion, 60(2):91–110, 2004

work page 2004

[17] [17]

A performance evaluation of local descriptors

Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. IEEE transactions on pat- tern analysis and machine intelligence , 27(10):1615–1630, 2005

work page 2005

[18] [18]

A comparison of afﬁne region detectors

Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and Luc Van Gool. A comparison of afﬁne region detectors. International journal of computer vision , 65(1- 2):43–72, 2005

work page 2005

[19] [19]

Lf-net: learning local features from images

Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. Lf-net: learning local features from images. In Advances in Neural Information Processing Systems , pages 6237–6247, 2018

work page 2018

[20] [20]

Machine learning for high-speed corner detection

Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. In European conference on computer vision, pages 430–443. Springer, 2006

work page 2006

[21] [21]

Quad-networks: unsupervised learning to rank for interest point detection

Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sat- tler, and Marc Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1822–1830, 2017

work page 2017

[22] [22]

Evaluation of interest point detectors

Cordelia Schmid, Roger Mohr, and Christian Bauckhage. Evaluation of interest point detectors. International Journal of computer vision, 37(2):151–172, 2000

work page 2000

[23] [23]

Training for task speciﬁc keypoint detection

Christoph Strecha, Albrecht Lindner, Karim Ali, and Pascal Fua. Training for task speciﬁc keypoint detection. In Joint Pattern Recognition Symposium , pages 151–160. Springer, 2009

work page 2009

[24] [24]

Tilde: a temporally invariant learned detector

Yannick Verdie, Kwang Yi, Pascal Fua, and Vincent Lepetit. Tilde: a temporally invariant learned detector. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5279–5288, 2015

work page 2015

[25] [25]

Worrall, Stephan J

Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukham- betov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. InIEEE Conference on Computer Vision & Pattern Recognition, pages 5028–5037, 2017

work page 2017

[26] [26]

Lift: Learned invariant feature transform

Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. In European Conference on Computer Vision , pages 467–483. Springer, 2016

work page 2016

[27] [27]

Deconvolutional networks

Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. 2010. Supplementary Material In this supplementary material we give more details of our implementation and experimential results. In section 1 we implement the Expectation Maximization (EM) process with an efﬁcient strategy called Mini-Batch approximation of EM (M...

work page 2010

[28] [28]

Mini-Batch Approximation of Expectation Maxmization 1.1. Problems of Original Expectation Maxmization Before discuss the difﬁcultes of optimizing our objective with original Expectation Maxmization algorithm (EM), we ﬁrst overview the objective of our training framework, which is formulated as { arg max θF,θD,y1,.,yK P (θF,θD,y 1,.,y K) =∏ kr ( yk) · ˆc (...

work page

[29] [29]

It’s easy to check whether a speciﬁcy(m) satisﬁes sparsity constraint or not, but it’s difﬁcult to straightforward obtain all satisﬁedy(m)

The deﬁnition of sample spaceY is integrated with spar- sity constraint. It’s easy to check whether a speciﬁcy(m) satisﬁes sparsity constraint or not, but it’s difﬁcult to straightforward obtain all satisﬁedy(m)

work page

[30] [30]

Whereas it’s hard to solve (41) precisely, this subsec- tion introduce an efﬁcient strategy to approximatepi which comprises of three step

To solve distribution pi, we must consider all points be- cause ˆci depends on vectory. Whereas it’s hard to solve (41) precisely, this subsec- tion introduce an efﬁcient strategy to approximatepi which comprises of three step. First, we approximate sample space Y with ˆY which can be obtained efﬁciently. Second, the ˆci(y) is approximated with ˆci(ˆy) wh...

work page

[31] [31]

In PN, both detector and de- scriptor are implemented with Fully Convolutional Network [15, 27], whose architecture is inspired by [9]

Architecture of Property Network Property Network (PN) is a speciﬁc implementation of our training framework. In PN, both detector and de- scriptor are implemented with Fully Convolutional Network [15, 27], whose architecture is inspired by [9]. Figure 4 shows the architecture of PN brieﬂy. Detector comprises of encoder and detection decoder, and descript...

work page

[32] [32]

For illumination simulation we randomly select the transformations to change pixel value

Simulations for Viewpoint and Illumination Changes In this section we outline our simulations for illumina- tion and viewpoint changes. For illumination simulation we randomly select the transformations to change pixel value. For viewpoint simulation we randomly generate homogra- phy matrices used to perform homography transformation. Figure 5 shows sever...

work page

[33] [33]

Apply Gaussian blur, average blur or me- dian bluron image

Image Blur. Apply Gaussian blur, average blur or me- dian bluron image

work page

[34] [34]

Permute the order of the color chan- nels of image

Channels Shufﬂe. Permute the order of the color chan- nels of image

work page

[35] [35]

Change the contrast in images by moving pixel values away or closer to 128

Contrast Normalization. Change the contrast in images by moving pixel values away or closer to 128

work page

[36] [36]

Convert images to grayscale and mixe with the original image with a random weight

Grayscale. Convert images to grayscale and mixe with the original image with a random weight

work page

[37] [37]

set them to 255− original pixel value

Invert all pixels in given image, i.e. set them to 255− original pixel value

work page

[38] [38]

Randomly replace some pixels with very white or black colors

Salt and Pepper noise. Randomly replace some pixels with very white or black colors

work page

[39] [39]

Randomly insert some dark shapes into image

Shadow. Randomly insert some dark shapes into image. In main text we have mentioned different PN models are trained with different level of simulations. PN-v-64 is trained with only Image Blur, Contrast Normalization and Shadow, and PN-i-64 and PN-128 are trained with all kind of simulations for illumination changes. In simulations for viewpoint changes, ...

work page

[40] [40]

In this section we introduce their deﬁnitions

Performance Metrics Two metrics used in our experiment are Matching Score and Homography Estimation, which are identical to that used in [9]. In this section we introduce their deﬁnitions. 4.1. Matching Score Matching Score measures the overall performance of in- terest point detector and descriptor. It measures the ratio of ground truth correspondences t...

work page

[41] [41]

In this subsection we give results of different methods on image size 320× 240

More Experimential Results In main text we have demostrated results on image size 640× 480. In this subsection we give results of different methods on image size 320× 240. All experiment conﬁg- ures are same as that for image size 640× 480 except the maximum number of extracted interest points. We keep no more than 300 interest points for 320× 240 images....

work page