Unsupervised Learning Framework of Interest Point Via Properties Optimization
Pith reviewed 2026-05-24 16:09 UTC · model grok-4.3
The pith
An unsupervised framework jointly trains interest point detectors and descriptors by optimizing sparsity, repeatability and discriminability as joint probabilities via EM.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By treating sparsity, repeatability and discriminability as a joint probability over extracted points and introducing a latent variable for the probability that any given point satisfies the required properties, the training objective can be maximized with the EM algorithm; the resulting Property Network, implemented as fully convolutional networks, produces detectors and descriptors that outperform state-of-the-art methods on multiple image matching benchmarks without any retraining or labeled data.
What carries the argument
Joint probability distribution over sparsity, repeatability and discriminability, maximized by latent-variable Expectation Maximization with mini-batch approximation.
If this is right
- Detector and descriptor can be learned jointly in a single unsupervised pass without ground-truth correspondences.
- The same network generalizes to new scene types without retraining or fine-tuning.
- Mini-batch EM makes the probabilistic objective tractable for large unlabeled image collections.
- The framework can instantiate different network architectures while keeping the same property-based objective.
Where Pith is reading between the lines
- If the probabilistic formulation holds, explicit supervision may be unnecessary once the right properties are encoded as probabilities rather than as hand-labeled points.
- The approach could be tested on tasks beyond rigid matching, such as non-rigid deformation or low-texture environments, to see whether the same three properties suffice.
- Replacing the fully convolutional backbone with other architectures would test whether the gains come mainly from the objective or from the network choice.
Load-bearing premise
That the three properties can be expressed as joint probabilities whose maximization through EM produces points that reliably correspond across real image transformations.
What would settle it
Evaluating the trained Property Network on HPatches or a similar benchmark and finding its matching or repeatability scores fall below those of a standard supervised baseline such as SIFT or SuperPoint.
Figures
read the original abstract
This paper presents an entirely unsupervised interest point training framework by jointly learning detector and descriptor, which takes an image as input and outputs a probability and a description for every image point. The objective of the training framework is formulated as joint probability distribution of the properties of the extracted points. The essential properties are selected as sparsity, repeatability and discriminability which are formulated by the probabilities. To maximize the objective efficiently, latent variable is introduced to represent the probability of that a point satisfies the required properties. Therefore, original maximization can be optimized with Expectation Maximization algorithm (EM). Considering high computation cost of EM on large scale image set, we implement the optimization process with an efficient strategy as Mini-Batch approximation of EM (MBEM). In the experiments both detector and descriptor are instantiated with fully convolutional network which is named as Property Network (PN). The experiments demonstrate that PN outperforms state-of-the-art methods on a number of image matching benchmarks without need of retraining. PN also reveals that the proposed training framework has high flexibility to adapt to diverse types of scenes.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. This paper proposes an entirely unsupervised framework for jointly learning interest point detectors and descriptors. An image is input to a network that outputs a probability and descriptor for every point; the training objective is defined as the joint probability distribution over three properties (sparsity, repeatability, discriminability) expressed as probabilities. A latent variable is introduced to represent whether a point satisfies the properties, allowing optimization via the EM algorithm, which is approximated by mini-batch EM (MBEM) for scalability. Both components are realized as a fully convolutional Property Network (PN) that is reported to outperform prior methods on image-matching benchmarks without retraining and to adapt to diverse scenes.
Significance. If the probability formulations and latent-variable EM construction can be shown to produce detectors and descriptors that genuinely encode repeatability and discriminability without supervision or known correspondences, the result would be significant: it would remove the requirement for labeled data or synthetic warps that currently limit supervised interest-point methods and could improve cross-scene generalization.
major comments (2)
- [Abstract] The central claim rests on the definitions of the three properties as joint probabilities and on the latent-variable EM step (via MBEM) actually enforcing repeatability and discriminability. The abstract supplies no equations for these probabilities, so it is impossible to verify whether the objective admits trivial solutions (e.g., uniform or edge-only selections) or whether the unsupervised proxy truly substitutes for cross-image correspondences.
- [Abstract] The performance claim that PN outperforms state-of-the-art methods on multiple benchmarks without retraining is load-bearing for the contribution. No quantitative results, benchmark names, or comparison tables are supplied, preventing assessment of whether the reported gains are statistically meaningful or whether the baselines were evaluated under identical conditions.
minor comments (1)
- The description of the fully convolutional architecture for simultaneous detection and description could be expanded with layer counts, receptive-field sizes, and how the probability and descriptor heads share features.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment point by point below, with proposed revisions to strengthen the abstract while preserving its conciseness.
read point-by-point responses
-
Referee: [Abstract] The central claim rests on the definitions of the three properties as joint probabilities and on the latent-variable EM step (via MBEM) actually enforcing repeatability and discriminability. The abstract supplies no equations for these probabilities, so it is impossible to verify whether the objective admits trivial solutions (e.g., uniform or edge-only selections) or whether the unsupervised proxy truly substitutes for cross-image correspondences.
Authors: The abstract is intentionally concise. The probability formulations for sparsity (modeled as a Bernoulli process encouraging sparse selection), repeatability (probability of consistent detection under transformations), and discriminability (probability of unique descriptors), along with the latent variable z indicating property satisfaction and the MBEM optimization, are fully derived in Sections 3.1–3.3. The joint objective is constructed to preclude trivial solutions because the discriminability term penalizes descriptor collisions across points, and the EM procedure alternates between inferring latent assignments and maximizing the expected joint probability. We will revise the abstract to explicitly name the three properties and reference the EM-based optimization to better convey the unsupervised proxy for correspondences. revision: partial
-
Referee: [Abstract] The performance claim that PN outperforms state-of-the-art methods on multiple benchmarks without retraining is load-bearing for the contribution. No quantitative results, benchmark names, or comparison tables are supplied, preventing assessment of whether the reported gains are statistically meaningful or whether the baselines were evaluated under identical conditions.
Authors: We agree that greater specificity would strengthen the abstract. Section 4 presents quantitative results with tables comparing PN against SuperPoint, LIFT, and other baselines on HPatches, Oxford Affine, and additional matching benchmarks, using identical evaluation protocols (e.g., matching score at 5-pixel threshold) and reporting mean improvements without any retraining on the test sets. We will revise the abstract to name the primary benchmarks and indicate that detailed tables appear in the experiments. revision: yes
Circularity Check
No circularity: objective defined from external properties and optimized via standard EM
full rationale
The derivation defines the training objective externally as the joint probability over three standard interest-point properties (sparsity, repeatability, discriminability) and applies latent-variable EM (MBEM) to maximize it. No equation or step reduces the target to a fitted parameter or self-citation by construction; the unsupervised proxy is motivated independently of the final benchmark performance, and the network instantiation is a conventional FCN. The central claim therefore rests on the empirical transfer of the learned detector/descriptor rather than on any definitional equivalence.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Interest points possess the essential properties of sparsity, repeatability and discriminability which can be formulated by probabilities.
Reference graph
Works this paper leans on
-
[1]
Alexandre Alahi, Raphael Ortiz, and Pierre Vandergheynst. Freak: Fast retina keypoint. In IEEE Conference on Com- puter Vision & Pattern Recognition, pages 510–517, 2012
work page 2012
-
[2]
Pablo Fern ´andez Alcantarilla, Adrien Bartoli, and Andrew J Davison. Kaze features. In European Conference on Com- puter Vision, pages 214–227. Springer, 2012
work page 2012
-
[3]
Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors
Vassileios Balntas, Karel Lenc, Andrea Vedaldi, and Krys- tian Mikolajczyk. Hpatches: A benchmark and evaluation of handcrafted and learned local descriptors. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5173–5182, 2017
work page 2017
-
[4]
Surf: Speeded up robust features
Herbert Bay, Tinne Tuytelaars, and Luc Van Gool. Surf: Speeded up robust features. InEuropean conference on com- puter vision, pages 404–417. Springer, 2006
work page 2006
-
[5]
Brief: Binary robust independent elementary features
Michael Calonder, Vincent Lepetit, Christoph Strecha, and Pascal Fua. Brief: Binary robust independent elementary features. In European conference on computer vision, pages 778–792. Springer, 2010
work page 2010
-
[6]
Matching features without descriptors: Implicitly matched interest points (imips)
Titus Cieslewski, Michael Bloesch, and Davide Scaramuzza. Matching features without descriptors: Implicitly matched interest points (imips). arXiv preprint arXiv:1811.10681 , 2018
-
[7]
Sips: un- supervised succinct interest points
Titus Cieslewski and Davide Scaramuzza. Sips: un- supervised succinct interest points. arXiv preprint arXiv:1805.01358, 2018
-
[8]
Group equivariant convo- lutional networks
Taco Cohen and Max Welling. Group equivariant convo- lutional networks. In International conference on machine learning, pages 2990–2999, 2016
work page 2016
-
[9]
Superpoint: Self-supervised interest point detection and description
Daniel DeTone, Tomasz Malisiewicz, and Andrew Rabi- novich. Superpoint: Self-supervised interest point detection and description. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition Workshops, pages 224–236, 2018
work page 2018
-
[10]
A combined cor- ner and edge detector
Christopher G Harris, Mike Stephens, et al. A combined cor- ner and edge detector. InAlvey vision conference, volume 15, pages 10–5244. Citeseer, 1988
work page 1988
-
[11]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing internal co- variate shift. arXiv preprint arXiv:1502.03167, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[12]
Adam: A Method for Stochastic Optimization
Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980 , 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[13]
Brisk: Binary robust invariant scalable keypoints
Stefan Leutenegger, Margarita Chli, and Roland Siegwart. Brisk: Binary robust invariant scalable keypoints. In 2011 IEEE international conference on computer vision (ICCV) , pages 2548–2555. Ieee, 2011
work page 2011
-
[14]
Microsoft coco: Common objects in context
Tsung-Yi Lin, Michael Maire, Serge Belongie, James Hays, Pietro Perona, Deva Ramanan, Piotr Doll´ar, and C Lawrence Zitnick. Microsoft coco: Common objects in context. In European conference on computer vision , pages 740–755. Springer, 2014
work page 2014
-
[15]
Fully convolutional networks for semantic segmentation
Jonathan Long, Evan Shelhamer, and Trevor Darrell. Fully convolutional networks for semantic segmentation. In Pro- ceedings of the IEEE conference on computer vision and pat- tern recognition, pages 3431–3440, 2015
work page 2015
-
[16]
Distinctive image features from scale- invariant keypoints
David G Lowe. Distinctive image features from scale- invariant keypoints. International journal of computer vi- sion, 60(2):91–110, 2004
work page 2004
-
[17]
A performance evaluation of local descriptors
Krystian Mikolajczyk and Cordelia Schmid. A performance evaluation of local descriptors. IEEE transactions on pat- tern analysis and machine intelligence , 27(10):1615–1630, 2005
work page 2005
-
[18]
A comparison of affine region detectors
Krystian Mikolajczyk, Tinne Tuytelaars, Cordelia Schmid, Andrew Zisserman, Jiri Matas, Frederik Schaffalitzky, Timor Kadir, and Luc Van Gool. A comparison of affine region detectors. International journal of computer vision , 65(1- 2):43–72, 2005
work page 2005
-
[19]
Lf-net: learning local features from images
Yuki Ono, Eduard Trulls, Pascal Fua, and Kwang Moo Yi. Lf-net: learning local features from images. In Advances in Neural Information Processing Systems , pages 6237–6247, 2018
work page 2018
-
[20]
Machine learning for high-speed corner detection
Edward Rosten and Tom Drummond. Machine learning for high-speed corner detection. In European conference on computer vision, pages 430–443. Springer, 2006
work page 2006
-
[21]
Quad-networks: unsupervised learning to rank for interest point detection
Nikolay Savinov, Akihito Seki, Lubor Ladicky, Torsten Sat- tler, and Marc Pollefeys. Quad-networks: unsupervised learning to rank for interest point detection. In Proceed- ings of the IEEE conference on computer vision and pattern recognition, pages 1822–1830, 2017
work page 2017
-
[22]
Evaluation of interest point detectors
Cordelia Schmid, Roger Mohr, and Christian Bauckhage. Evaluation of interest point detectors. International Journal of computer vision, 37(2):151–172, 2000
work page 2000
-
[23]
Training for task specific keypoint detection
Christoph Strecha, Albrecht Lindner, Karim Ali, and Pascal Fua. Training for task specific keypoint detection. In Joint Pattern Recognition Symposium , pages 151–160. Springer, 2009
work page 2009
-
[24]
Tilde: a temporally invariant learned detector
Yannick Verdie, Kwang Yi, Pascal Fua, and Vincent Lepetit. Tilde: a temporally invariant learned detector. In Proceed- ings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 5279–5288, 2015
work page 2015
-
[25]
Daniel E. Worrall, Stephan J. Garbin, Daniyar Turmukham- betov, and Gabriel J. Brostow. Harmonic networks: Deep translation and rotation equivariance. InIEEE Conference on Computer Vision & Pattern Recognition, pages 5028–5037, 2017
work page 2017
-
[26]
Lift: Learned invariant feature transform
Kwang Moo Yi, Eduard Trulls, Vincent Lepetit, and Pascal Fua. Lift: Learned invariant feature transform. In European Conference on Computer Vision , pages 467–483. Springer, 2016
work page 2016
-
[27]
Matthew D Zeiler, Dilip Krishnan, Graham W Taylor, and Rob Fergus. Deconvolutional networks. 2010. Supplementary Material In this supplementary material we give more details of our implementation and experimential results. In section 1 we implement the Expectation Maximization (EM) process with an efficient strategy called Mini-Batch approximation of EM (M...
work page 2010
-
[28]
Mini-Batch Approximation of Expectation Maxmization 1.1. Problems of Original Expectation Maxmization Before discuss the difficultes of optimizing our objective with original Expectation Maxmization algorithm (EM), we first overview the objective of our training framework, which is formulated as { arg max θF,θD,y1,.,yK P (θF,θD,y 1,.,y K) =∏ kr ( yk) · ˆc (...
-
[29]
The definition of sample spaceY is integrated with spar- sity constraint. It’s easy to check whether a specificy(m) satisfies sparsity constraint or not, but it’s difficult to straightforward obtain all satisfiedy(m)
-
[30]
To solve distribution pi, we must consider all points be- cause ˆci depends on vectory. Whereas it’s hard to solve (41) precisely, this subsec- tion introduce an efficient strategy to approximatepi which comprises of three step. First, we approximate sample space Y with ˆY which can be obtained efficiently. Second, the ˆci(y) is approximated with ˆci(ˆy) wh...
-
[31]
Architecture of Property Network Property Network (PN) is a specific implementation of our training framework. In PN, both detector and de- scriptor are implemented with Fully Convolutional Network [15, 27], whose architecture is inspired by [9]. Figure 4 shows the architecture of PN briefly. Detector comprises of encoder and detection decoder, and descript...
-
[32]
For illumination simulation we randomly select the transformations to change pixel value
Simulations for Viewpoint and Illumination Changes In this section we outline our simulations for illumina- tion and viewpoint changes. For illumination simulation we randomly select the transformations to change pixel value. For viewpoint simulation we randomly generate homogra- phy matrices used to perform homography transformation. Figure 5 shows sever...
-
[33]
Apply Gaussian blur, average blur or me- dian bluron image
Image Blur. Apply Gaussian blur, average blur or me- dian bluron image
-
[34]
Permute the order of the color chan- nels of image
Channels Shuffle. Permute the order of the color chan- nels of image
-
[35]
Change the contrast in images by moving pixel values away or closer to 128
Contrast Normalization. Change the contrast in images by moving pixel values away or closer to 128
-
[36]
Convert images to grayscale and mixe with the original image with a random weight
Grayscale. Convert images to grayscale and mixe with the original image with a random weight
-
[37]
set them to 255− original pixel value
Invert all pixels in given image, i.e. set them to 255− original pixel value
-
[38]
Randomly replace some pixels with very white or black colors
Salt and Pepper noise. Randomly replace some pixels with very white or black colors
-
[39]
Randomly insert some dark shapes into image
Shadow. Randomly insert some dark shapes into image. In main text we have mentioned different PN models are trained with different level of simulations. PN-v-64 is trained with only Image Blur, Contrast Normalization and Shadow, and PN-i-64 and PN-128 are trained with all kind of simulations for illumination changes. In simulations for viewpoint changes, ...
-
[40]
In this section we introduce their definitions
Performance Metrics Two metrics used in our experiment are Matching Score and Homography Estimation, which are identical to that used in [9]. In this section we introduce their definitions. 4.1. Matching Score Matching Score measures the overall performance of in- terest point detector and descriptor. It measures the ratio of ground truth correspondences t...
-
[41]
In this subsection we give results of different methods on image size 320× 240
More Experimential Results In main text we have demostrated results on image size 640× 480. In this subsection we give results of different methods on image size 320× 240. All experiment config- ures are same as that for image size 640× 480 except the maximum number of extracted interest points. We keep no more than 300 interest points for 320× 240 images....
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.