pith. sign in

arxiv: 1907.11375 · v1 · pith:BDUU2CXYnew · submitted 2019-07-26 · 💻 cs.CV

Unsupervised Learning Framework of Interest Point Via Properties Optimization

Pith reviewed 2026-05-24 16:09 UTC · model grok-4.3

classification 💻 cs.CV
keywords unsupervised learninginterest point detectionfeature descriptorexpectation maximizationimage matchingproperty optimizationfully convolutional network
0
0 comments X

The pith

An unsupervised framework jointly trains interest point detectors and descriptors by optimizing sparsity, repeatability and discriminability as joint probabilities via EM.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tries to establish that interest points can be learned entirely without labels by defining their key properties as probabilities and maximizing their joint distribution with a latent-variable EM procedure. A sympathetic reader would care because supervised training of detectors requires expensive point annotations that often fail to transfer across domains, while this approach claims to produce a single network that works on diverse scenes out of the box. The method instantiates both the detector and descriptor inside a fully convolutional Property Network and shows it exceeds prior art on standard matching benchmarks. The optimization uses a mini-batch approximation of EM to keep computation feasible on large image collections.

Core claim

By treating sparsity, repeatability and discriminability as a joint probability over extracted points and introducing a latent variable for the probability that any given point satisfies the required properties, the training objective can be maximized with the EM algorithm; the resulting Property Network, implemented as fully convolutional networks, produces detectors and descriptors that outperform state-of-the-art methods on multiple image matching benchmarks without any retraining or labeled data.

What carries the argument

Joint probability distribution over sparsity, repeatability and discriminability, maximized by latent-variable Expectation Maximization with mini-batch approximation.

If this is right

  • Detector and descriptor can be learned jointly in a single unsupervised pass without ground-truth correspondences.
  • The same network generalizes to new scene types without retraining or fine-tuning.
  • Mini-batch EM makes the probabilistic objective tractable for large unlabeled image collections.
  • The framework can instantiate different network architectures while keeping the same property-based objective.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If the probabilistic formulation holds, explicit supervision may be unnecessary once the right properties are encoded as probabilities rather than as hand-labeled points.
  • The approach could be tested on tasks beyond rigid matching, such as non-rigid deformation or low-texture environments, to see whether the same three properties suffice.
  • Replacing the fully convolutional backbone with other architectures would test whether the gains come mainly from the objective or from the network choice.

Load-bearing premise

That the three properties can be expressed as joint probabilities whose maximization through EM produces points that reliably correspond across real image transformations.

What would settle it

Evaluating the trained Property Network on HPatches or a similar benchmark and finding its matching or repeatability scores fall below those of a standard supervised baseline such as SIFT or SuperPoint.

Figures

Figures reproduced from arXiv: 1907.11375 by Cai Wen, Pei Yan, Yihua Tan, Yuan Tai, Yuan Xiao.

Figure 1
Figure 1. Figure 1: Overview of unsupervised training framework via optimizing the properties of interest point. It consists of three parts: (1) training [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Visual matching results of state-of-the-art algorithms and our PN-i-64 model (M-score indicates Matching Score, and Homo-error [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Visual matching results of Superpoint and different Property Networks. First col: PN-i-64 model, second col: PN-v-64 model, [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of Property Network. Original Image Homography Blur Channels Shuffle Contrast Normalization Grayscale Invert Salt and Pepper Shadow [PITH_FULL_IMAGE:figures/full_fig_p013_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Examples of simulations for illumination and viewpoint [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Visual matching results of state-of-the-art algorithms and our PN-i-64 model (M-score indicates Matching Score, and Homo-error [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
read the original abstract

This paper presents an entirely unsupervised interest point training framework by jointly learning detector and descriptor, which takes an image as input and outputs a probability and a description for every image point. The objective of the training framework is formulated as joint probability distribution of the properties of the extracted points. The essential properties are selected as sparsity, repeatability and discriminability which are formulated by the probabilities. To maximize the objective efficiently, latent variable is introduced to represent the probability of that a point satisfies the required properties. Therefore, original maximization can be optimized with Expectation Maximization algorithm (EM). Considering high computation cost of EM on large scale image set, we implement the optimization process with an efficient strategy as Mini-Batch approximation of EM (MBEM). In the experiments both detector and descriptor are instantiated with fully convolutional network which is named as Property Network (PN). The experiments demonstrate that PN outperforms state-of-the-art methods on a number of image matching benchmarks without need of retraining. PN also reveals that the proposed training framework has high flexibility to adapt to diverse types of scenes.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. This paper proposes an entirely unsupervised framework for jointly learning interest point detectors and descriptors. An image is input to a network that outputs a probability and descriptor for every point; the training objective is defined as the joint probability distribution over three properties (sparsity, repeatability, discriminability) expressed as probabilities. A latent variable is introduced to represent whether a point satisfies the properties, allowing optimization via the EM algorithm, which is approximated by mini-batch EM (MBEM) for scalability. Both components are realized as a fully convolutional Property Network (PN) that is reported to outperform prior methods on image-matching benchmarks without retraining and to adapt to diverse scenes.

Significance. If the probability formulations and latent-variable EM construction can be shown to produce detectors and descriptors that genuinely encode repeatability and discriminability without supervision or known correspondences, the result would be significant: it would remove the requirement for labeled data or synthetic warps that currently limit supervised interest-point methods and could improve cross-scene generalization.

major comments (2)
  1. [Abstract] The central claim rests on the definitions of the three properties as joint probabilities and on the latent-variable EM step (via MBEM) actually enforcing repeatability and discriminability. The abstract supplies no equations for these probabilities, so it is impossible to verify whether the objective admits trivial solutions (e.g., uniform or edge-only selections) or whether the unsupervised proxy truly substitutes for cross-image correspondences.
  2. [Abstract] The performance claim that PN outperforms state-of-the-art methods on multiple benchmarks without retraining is load-bearing for the contribution. No quantitative results, benchmark names, or comparison tables are supplied, preventing assessment of whether the reported gains are statistically meaningful or whether the baselines were evaluated under identical conditions.
minor comments (1)
  1. The description of the fully convolutional architecture for simultaneous detection and description could be expanded with layer counts, receptive-field sizes, and how the probability and descriptor heads share features.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, with proposed revisions to strengthen the abstract while preserving its conciseness.

read point-by-point responses
  1. Referee: [Abstract] The central claim rests on the definitions of the three properties as joint probabilities and on the latent-variable EM step (via MBEM) actually enforcing repeatability and discriminability. The abstract supplies no equations for these probabilities, so it is impossible to verify whether the objective admits trivial solutions (e.g., uniform or edge-only selections) or whether the unsupervised proxy truly substitutes for cross-image correspondences.

    Authors: The abstract is intentionally concise. The probability formulations for sparsity (modeled as a Bernoulli process encouraging sparse selection), repeatability (probability of consistent detection under transformations), and discriminability (probability of unique descriptors), along with the latent variable z indicating property satisfaction and the MBEM optimization, are fully derived in Sections 3.1–3.3. The joint objective is constructed to preclude trivial solutions because the discriminability term penalizes descriptor collisions across points, and the EM procedure alternates between inferring latent assignments and maximizing the expected joint probability. We will revise the abstract to explicitly name the three properties and reference the EM-based optimization to better convey the unsupervised proxy for correspondences. revision: partial

  2. Referee: [Abstract] The performance claim that PN outperforms state-of-the-art methods on multiple benchmarks without retraining is load-bearing for the contribution. No quantitative results, benchmark names, or comparison tables are supplied, preventing assessment of whether the reported gains are statistically meaningful or whether the baselines were evaluated under identical conditions.

    Authors: We agree that greater specificity would strengthen the abstract. Section 4 presents quantitative results with tables comparing PN against SuperPoint, LIFT, and other baselines on HPatches, Oxford Affine, and additional matching benchmarks, using identical evaluation protocols (e.g., matching score at 5-pixel threshold) and reporting mean improvements without any retraining on the test sets. We will revise the abstract to name the primary benchmarks and indicate that detailed tables appear in the experiments. revision: yes

Circularity Check

0 steps flagged

No circularity: objective defined from external properties and optimized via standard EM

full rationale

The derivation defines the training objective externally as the joint probability over three standard interest-point properties (sparsity, repeatability, discriminability) and applies latent-variable EM (MBEM) to maximize it. No equation or step reduces the target to a fitted parameter or self-citation by construction; the unsupervised proxy is motivated independently of the final benchmark performance, and the network instantiation is a conventional FCN. The central claim therefore rests on the empirical transfer of the learned detector/descriptor rather than on any definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The framework rests on the domain assumption that the three listed properties can be expressed as probabilities suitable for EM; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Interest points possess the essential properties of sparsity, repeatability and discriminability which can be formulated by probabilities.
    Directly stated in the abstract as the basis for the objective.

pith-pipeline@v0.9.0 · 5716 in / 1180 out tokens · 24710 ms · 2026-05-24T16:09:20.123625+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

41 extracted references · 41 canonical work pages · 2 internal anchors

  1. [1]

    Freak: Fast retina keypoint

    Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. Freak: Fast retina keypoint. In IEEE Conference on Com- puter Vision & Pattern Recognition, pages 510–517, 2012

  2. [2]

    Kaze features

    Pablo Fern ´andez Alcantarilla, Adrien Bartoli, and Andrew J Davison. Kaze features. In European Conference on Com- puter Vision, pages 214–227. Springer, 2012

  3. [3]

    Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors

    Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5173–5182, 2017

  4. [4]

    Surf: Speeded up robust features

    Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on com- puter vision, pages 404–417. Springer, 2006

  5. [5]

    Brief: Binary robust independent elementary features

    Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. In European conference on computer vision, pages 778–792. Springer, 2010

  6. [6]

    Matching features without descriptors: Implicitly matched interest points (imips)

    Titus Cieslewski, Michael Bloesch, and Davide Scaramuzza. Matching features without descriptors: Implicitly matched interest points (imips). arXiv preprint arXiv:1811.10681 , 2018

  7. [7]

    Sips: un- supervised succinct interest points

    Titus Cieslewski and Davide Scaramuzza. Sips: un- supervised succinct interest points. arXiv preprint arXiv:1805.01358, 2018

  8. [8]

    Group equivariant convo- lutional networks

    Taco Cohen and Max Welling. Group equivariant convo- lutional networks. In International conference on machine learning, pages 2990–2999, 2016

  9. [9]

    Superpoint: Self-supervised interest point detection and description

    Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 224–236, 2018

  10. [10]

    A combined cor- ner and edge detector

    Christopher G Harris, Mike Stephens, et al. A combined cor- ner and edge detector. InAlvey vision conference, volume 15, pages 10–5244. Citeseer, 1988

  11. [11]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015

  12. [12]

    Adam: A Method for Stochastic Optimization

    Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014

  13. [13]

    Brisk: Binary robust invariant scalable keypoints

    Stefan Leutenegger, Margarita Chli, and Roland Siegwart. Brisk: Binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV) , pages 2548–2555. Ieee, 2011

  14. [14]

    Microsoft coco: Common objects in context

    Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014

  15. [15]

    Fully convolutional networks for semantic segmentation

    Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015

  16. [16]

    Distinctive image features from scale- invariant keypoints

    David G Lowe. Distinctive image features from scale- invariant keypoints. International journal of computer vi- sion, 60(2):91–110, 2004

  17. [17]

    A performance evaluation of local descriptors

    Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. IEEE transactions on pat- tern analysis and machine intelligence , 27(10):1615–1630, 2005

  18. [18]

    A comparison of affine region detectors

    Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and Luc Van Gool. A comparison of affine region detectors. International journal of computer vision , 65(1- 2):43–72, 2005

  19. [19]

    Lf-net: learning local features from images

    Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. Lf-net: learning local features from images. In Advances in Neural Information Processing Systems , pages 6237–6247, 2018

  20. [20]

    Machine learning for high-speed corner detection

    Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. In European conference on computer vision, pages 430–443. Springer, 2006

  21. [21]

    Quad-networks: unsupervised learning to rank for interest point detection

    Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sat- tler, and Marc Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1822–1830, 2017

  22. [22]

    Evaluation of interest point detectors

    Cordelia Schmid, Roger Mohr, and Christian Bauckhage. Evaluation of interest point detectors. International Journal of computer vision, 37(2):151–172, 2000

  23. [23]

    Training for task specific keypoint detection

    Christoph Strecha, Albrecht Lindner, Karim Ali, and Pascal Fua. Training for task specific keypoint detection. In Joint Pattern Recognition Symposium , pages 151–160. Springer, 2009

  24. [24]

    Tilde: a temporally invariant learned detector

    Yannick Verdie, Kwang Yi, Pascal Fua, and Vincent Lepetit. Tilde: a temporally invariant learned detector. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5279–5288, 2015

  25. [25]

    Worrall, Stephan J

    Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukham- betov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. InIEEE Conference on Computer Vision & Pattern Recognition, pages 5028–5037, 2017

  26. [26]

    Lift: Learned invariant feature transform

    Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. In European Conference on Computer Vision , pages 467–483. Springer, 2016

  27. [27]

    Deconvolutional networks

    Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. 2010. Supplementary Material In this supplementary material we give more details of our implementation and experimential results. In section 1 we implement the Expectation Maximization (EM) process with an efficient strategy called Mini-Batch approximation of EM (M...

  28. [28]

    Mini-Batch Approximation of Expectation Maxmization 1.1. Problems of Original Expectation Maxmization Before discuss the difficultes of optimizing our objective with original Expectation Maxmization algorithm (EM), we first overview the objective of our training framework, which is formulated as { arg max θF,θD,y1,.,yK P (θF,θD,y 1,.,y K) =∏ kr ( yk) · ˆc (...

  29. [29]

    It’s easy to check whether a specificy(m) satisfies sparsity constraint or not, but it’s difficult to straightforward obtain all satisfiedy(m)

    The definition of sample spaceY is integrated with spar- sity constraint. It’s easy to check whether a specificy(m) satisfies sparsity constraint or not, but it’s difficult to straightforward obtain all satisfiedy(m)

  30. [30]

    Whereas it’s hard to solve (41) precisely, this subsec- tion introduce an efficient strategy to approximatepi which comprises of three step

    To solve distribution pi, we must consider all points be- cause ˆci depends on vectory. Whereas it’s hard to solve (41) precisely, this subsec- tion introduce an efficient strategy to approximatepi which comprises of three step. First, we approximate sample space Y with ˆY which can be obtained efficiently. Second, the ˆci(y) is approximated with ˆci(ˆy) wh...

  31. [31]

    In PN, both detector and de- scriptor are implemented with Fully Convolutional Network [15, 27], whose architecture is inspired by [9]

    Architecture of Property Network Property Network (PN) is a specific implementation of our training framework. In PN, both detector and de- scriptor are implemented with Fully Convolutional Network [15, 27], whose architecture is inspired by [9]. Figure 4 shows the architecture of PN briefly. Detector comprises of encoder and detection decoder, and descript...

  32. [32]

    For illumination simulation we randomly select the transformations to change pixel value

    Simulations for Viewpoint and Illumination Changes In this section we outline our simulations for illumina- tion and viewpoint changes. For illumination simulation we randomly select the transformations to change pixel value. For viewpoint simulation we randomly generate homogra- phy matrices used to perform homography transformation. Figure 5 shows sever...

  33. [33]

    Apply Gaussian blur, average blur or me- dian bluron image

    Image Blur. Apply Gaussian blur, average blur or me- dian bluron image

  34. [34]

    Permute the order of the color chan- nels of image

    Channels Shuffle. Permute the order of the color chan- nels of image

  35. [35]

    Change the contrast in images by moving pixel values away or closer to 128

    Contrast Normalization. Change the contrast in images by moving pixel values away or closer to 128

  36. [36]

    Convert images to grayscale and mixe with the original image with a random weight

    Grayscale. Convert images to grayscale and mixe with the original image with a random weight

  37. [37]

    set them to 255− original pixel value

    Invert all pixels in given image, i.e. set them to 255− original pixel value

  38. [38]

    Randomly replace some pixels with very white or black colors

    Salt and Pepper noise. Randomly replace some pixels with very white or black colors

  39. [39]

    Randomly insert some dark shapes into image

    Shadow. Randomly insert some dark shapes into image. In main text we have mentioned different PN models are trained with different level of simulations. PN-v-64 is trained with only Image Blur, Contrast Normalization and Shadow, and PN-i-64 and PN-128 are trained with all kind of simulations for illumination changes. In simulations for viewpoint changes, ...

  40. [40]

    In this section we introduce their definitions

    Performance Metrics Two metrics used in our experiment are Matching Score and Homography Estimation, which are identical to that used in [9]. In this section we introduce their definitions. 4.1. Matching Score Matching Score measures the overall performance of in- terest point detector and descriptor. It measures the ratio of ground truth correspondences t...

  41. [41]

    In this subsection we give results of different methods on image size 320× 240

    More Experimential Results In main text we have demostrated results on image size 640× 480. In this subsection we give results of different methods on image size 320× 240. All experiment config- ures are same as that for image size 640× 480 except the maximum number of extracted interest points. We keep no more than 300 interest points for 320× 240 images....