Recognition: 2 theorem links
· Lean TheoremInformation theoretic underpinning of self-supervised learning by clustering
Pith reviewed 2026-05-13 06:14 UTC · model grok-4.3
The pith
Self-supervised learning by clustering emerges from KL-divergence minimization with a teacher-distribution constraint.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By analogy to supervised learning, SSL is formulated as KL-divergence optimization. Mode collapse is prevented by imposing an optimisation constraint on the teacher distribution. This leads to normalization using inverse cluster priors. Using Jensen's inequality this normalization simplifies to the popular batch centering procedure. The theoretical model supports specific existing successful SSL methods and suggests directions for future investigations.
What carries the argument
KL-divergence minimization between student predictions and a teacher distribution whose normalization is fixed by inverse cluster priors.
If this is right
- Distillation and centering shift from heuristics to consequences of the constrained KL objective.
- Existing clustering-based SSL algorithms receive a common information-theoretic justification.
- New SSL procedures can be obtained by varying the form of the teacher constraint while preserving the KL structure.
- The same framework supplies a route to analyze why certain normalizations succeed or fail in practice.
Where Pith is reading between the lines
- The KL-plus-constraint view could be tested on contrastive or reconstruction-based SSL to see whether analogous teacher constraints emerge.
- Relaxing the inverse-prior requirement might reveal whether centering remains necessary or can be replaced by other normalizers.
- Information-theoretic bounds derived from the same objective could quantify how much supervision is implicitly provided by the clustering signal.
Load-bearing premise
That the required constraint on the teacher distribution takes precisely the form of inverse cluster priors, which both blocks collapse and allows Jensen's inequality to recover batch centering.
What would settle it
An explicit calculation or numerical check demonstrating that the constrained KL objective does not reduce to batch centering after applying Jensen's inequality, or an implementation in which the inverse-prior normalization fails to prevent collapse while centering still succeeds.
read the original abstract
Self-supervised learning (SSL) is recognized as an essential tool for building foundation models for Artificial Intelligence applications. The advances in SSL have been made thanks to vigorous arguments about the principles of SSL and through extensive empirical research. The aim of this paper is to contribute to the development of the underpinning theory of SSL, focusing on the deep clustering approach. By analogy to supervised learning, we formulate SSL as K-L divergence optimization. The mode collapse is prevented by imposing an optimisation constraint on the teacher distribution. This leads to normalization using inverse cluster priors. We show that using Jensen inequality this normalization simplifies to the popular batch centering procedure. Distillation and centering are common {heuristics-based} practices in SSL, {but our work underpins them theoretically.} The theoretical model developed not only supports specific existing successful SSL methods, but also suggests directions for future investigations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript formulates self-supervised learning (SSL) by clustering as minimization of the Kullback-Leibler (KL) divergence between student and teacher distributions, by analogy to supervised learning. Mode collapse is prevented via an optimization constraint on the teacher distribution that normalizes using inverse cluster priors; Jensen's inequality is then applied to show that this normalization reduces to the standard batch-centering procedure. The work claims this supplies a theoretical underpinning for distillation and centering heuristics used in existing SSL methods.
Significance. If the constraint on the teacher distribution can be shown to arise necessarily from the KL objective and collapse-prevention requirement rather than being selected to recover centering, the result would provide a principled information-theoretic justification for widely used SSL practices and could guide the design of new algorithms. The paper correctly highlights the role of normalization in avoiding collapse and connects it to an existing heuristic, but the overall significance hinges on resolving the independence of the constraint derivation.
major comments (1)
- [Abstract and derivation of teacher-distribution constraint] Abstract and main derivation: The optimization constraint on the teacher distribution is introduced as normalization by inverse cluster priors without an independent derivation showing why this specific form is the minimal or natural choice that both prevents mode collapse and remains compatible with the KL objective. The subsequent application of Jensen's inequality then recovers batch centering, which raises the possibility that the constraint was chosen precisely because it produces the known result. This step is load-bearing for the central claim of providing a 'theoretical underpinning.'
Simulated Author's Rebuttal
We thank the referee for their insightful comments on our work. We address the major comment regarding the derivation of the teacher distribution constraint in detail below. We believe our response clarifies the motivation and we propose revisions to enhance the presentation.
read point-by-point responses
-
Referee: [Abstract and derivation of teacher-distribution constraint] Abstract and main derivation: The optimization constraint on the teacher distribution is introduced as normalization by inverse cluster priors without an independent derivation showing why this specific form is the minimal or natural choice that both prevents mode collapse and remains compatible with the KL objective. The subsequent application of Jensen's inequality then recovers batch centering, which raises the possibility that the constraint was chosen precisely because it produces the known result. This step is load-bearing for the central claim of providing a 'theoretical underpinning.'
Authors: We agree with the referee that a clearer independent motivation for the specific form of the constraint would strengthen the paper. In the revised manuscript, we will expand the derivation section to show that the constraint arises from requiring the teacher distribution to have uniform marginal probabilities to prevent mode collapse in the KL minimization. This leads naturally to normalization by the inverse of the cluster priors (estimated from the batch), as this ensures the expected value under the student is balanced. This is not chosen to recover centering but is the minimal constraint that maintains the probabilistic interpretation while avoiding trivial solutions. The subsequent use of Jensen's inequality demonstrates that this is equivalent to batch centering, thereby providing the theoretical link. We will also discuss potential alternative constraints and why this one is natural. revision: partial
Circularity Check
Teacher-distribution constraint introduced to recover batch centering via Jensen
specific steps
-
self definitional
[Abstract]
"The mode collapse is prevented by imposing an optimisation constraint on the teacher distribution. This leads to normalization using inverse cluster priors. We show that using Jensen inequality this normalization simplifies to the popular batch centering procedure."
The constraint is defined such that its normalization form (inverse cluster priors) is the one that, under Jensen, yields batch centering. The reduction to the known heuristic therefore holds by the choice of constraint rather than as a necessary consequence of the KL formulation alone; a different anti-collapse constraint would not recover centering.
full rationale
The paper formulates SSL clustering as KL-divergence minimization between student and teacher distributions. It then states that mode collapse is prevented by imposing an optimisation constraint on the teacher distribution, which directly leads to normalization by inverse cluster priors; Jensen's inequality is applied to show this equals batch centering. The specific constraint form is not derived as the unique or minimal anti-collapse requirement from the KL objective; instead it is presented because it produces the known centering heuristic. This makes the 'theoretical underpinning' reduce to a post-hoc choice whose output matches the target procedure by construction. No independent derivation or external validation of the constraint is supplied.
Axiom & Free-Parameter Ledger
free parameters (1)
- inverse cluster priors
axioms (1)
- domain assumption SSL by clustering can be formulated as KL-divergence optimization by direct analogy to supervised learning
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
formulate SSL as K-L divergence optimization... normalisation using inverse cluster priors... Jensen inequality this normalization simplifies to the popular batch centering procedure
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
regularised K-L divergence... Q(y|z) = P(y|z)/Q(y)
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
European conference on computer vision , pages=
Colorful image colorization , author=. European conference on computer vision , pages=. 2016 , organization=
work page 2016
-
[2]
Emanuele Sansone and Robin Manhaeve , title =. Trans. Mach. Learn. Res. , year =
-
[3]
A Probabilistic Model behind Self- Supervised Learning , journal =
Alice Bizeul and Bernhard Sch. A Probabilistic Model behind Self- Supervised Learning , journal =
- [4]
-
[5]
Elad Amrani and Leonid Karlinsky and Alexander M. Bronstein , editor =. Self-Supervised Classification Network , booktitle =
-
[6]
Jin Li and Yaoming Wang and Xiaopeng Zhang and Dongsheng Jiang and Wenrui Dai and Chenglin Li and Hongkai Xiong and Qi Tian , title =
-
[7]
Ting Chen and Simon Kornblith and Mohammad Norouzi and Geoffrey E. Hinton , title =. Proceedings of the 37th International Conference on Machine Learning,
-
[8]
Forty-first International Conference on Machine Learning,
Zhiquan Tan and Jingqin Yang and Weiran Huang and Yang Yuan and Yifan Zhang , title =. Forty-first International Conference on Machine Learning,
-
[9]
Ajinkya Tejankar and Soroush Abbasi Koohpayegani and Vipin Pillai and Paolo Favaro and Hamed Pirsiavash , title =. CoRR , volume =
-
[10]
Knowledge Distillation Meets Self-supervision , booktitle =
Guodong Xu and Ziwei Liu and Xiaoxiao Li and Chen Change Loy , editor =. Knowledge Distillation Meets Self-supervision , booktitle =
-
[11]
Patch-level Contrastive Learning via Positional Query for Visual Pre-training , booktitle =
Shaofeng Zhang and Qiang Zhou and Zhibin Wang and Fan Wang and Junchi Yan , editor =. Patch-level Contrastive Learning via Positional Query for Visual Pre-training , booktitle =
-
[12]
European conference on computer vision , pages=
Learning representations for automatic colorization , author=. European conference on computer vision , pages=. 2016 , organization=
work page 2016
-
[13]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Colorization as a proxy task for visual understanding , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[14]
Proceedings of the IEEE international conference on computer vision , pages=
Unsupervised visual representation learning by context prediction , author=. Proceedings of the IEEE international conference on computer vision , pages=
-
[15]
European conference on computer vision , pages=
Unsupervised learning of visual representations by solving jigsaw puzzles , author=. European conference on computer vision , pages=. 2016 , organization=
work page 2016
-
[16]
2018 IEEE Winter Conference on Applications of Computer Vision (WACV) , pages=
Learning image representations by completing damaged jigsaw puzzles , author=. 2018 IEEE Winter Conference on Applications of Computer Vision (WACV) , pages=. 2018 , organization=
work page 2018
-
[17]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Split-brain autoencoders: Unsupervised learning by cross-channel prediction , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[18]
International Conference on Machine Learning , pages=
Unsupervised learning by predicting noise , author=. International Conference on Machine Learning , pages=. 2017 , organization=
work page 2017
-
[19]
Unsupervised representation learning by predicting image rotations
Unsupervised representation learning by predicting image rotations , author=. arXiv preprint arXiv:1803.07728 , year=
-
[20]
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
Self-supervised feature learning by learning to spot artifacts , author=. Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition , pages=
-
[21]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[22]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [23]
-
[24]
DINOv3 , author=. arXiv preprint arXiv:2508.10104 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[25]
Proceedings of the European conference on computer vision (ECCV) , pages=
Deep clustering for unsupervised learning of visual features , author=. Proceedings of the European conference on computer vision (ECCV) , pages=
-
[26]
Nature Biomedical Engineering , volume=
Self-supervised learning in medicine and healthcare , author=. Nature Biomedical Engineering , volume=. 2022 , publisher=
work page 2022
-
[27]
International journal of computer vision , volume=
The pascal visual object classes challenge: A retrospective , author=. International journal of computer vision , volume=. 2015 , publisher=
work page 2015
-
[28]
International journal of computer vision , volume=
Visual genome: Connecting language and vision using crowdsourced dense image annotations , author=. International journal of computer vision , volume=. 2017 , publisher=
work page 2017
-
[29]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Exploring simple siamese representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[30]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
An empirical study of training self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[31]
European conference on computer vision , pages=
Microsoft coco: Common objects in context , author=. European conference on computer vision , pages=. 2014 , organization=
work page 2014
-
[32]
IEEE Geoscience and Remote Sensing Magazine , volume=
Self-supervised learning in remote sensing: A review , author=. IEEE Geoscience and Remote Sensing Magazine , volume=. 2022 , publisher=
work page 2022
-
[33]
Underwater self-supervised depth estimation , author=. Neurocomputing , volume=. 2022 , publisher=
work page 2022
-
[34]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Self-supervised learning of object parts for semantic segmentation , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[35]
Self-supervised learning for RGB-D object tracking , author=. Pattern Recognition , volume=. 2024 , publisher=
work page 2024
-
[36]
Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
Self-supervised learning of domain invariant features for depth estimation , author=. Proceedings of the IEEE/CVF winter conference on applications of computer vision , pages=
-
[37]
Atito, Sara and Awais, Muhammad and Kittler, Josef , journal=
-
[38]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Masked autoencoders are scalable vision learners , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[39]
Chen, Yabo and Liu, Yuchen and Jiang, Dongsheng and Zhang, Xiaopeng and Dai, Wenrui and Xiong, Hongkai and Tian, Qi , booktitle=. 2022 , organization=
work page 2022
-
[40]
Xie, Zhenda and Zhang, Zheng and Cao, Yue and Lin, Yutong and Bao, Jianmin and Yao, Zhuliang and Dai, Qi and Hu, Han , booktitle=
- [41]
-
[42]
Cognitive psychology , volume=
Forest before trees: The precedence of global features in visual perception , author=. Cognitive psychology , volume=. 1977 , publisher=
work page 1977
-
[43]
Progress in brain research , volume=
Building the gist of a scene: The role of global image features in recognition , author=. Progress in brain research , volume=. 2006 , publisher=
work page 2006
-
[44]
Trends in cognitive sciences , volume=
Making sense of real-world scenes , author=. Trends in cognitive sciences , volume=. 2016 , publisher=
work page 2016
-
[45]
How does the brain solve visual object recognition? , author=. Neuron , volume=. 2012 , publisher=
work page 2012
- [46]
-
[47]
Annual review of neuroscience , volume=
Neural mechanisms of selective visual attention , author=. Annual review of neuroscience , volume=
-
[48]
A map of object space in primate inferotemporal cortex , author=. Nature , volume=. 2020 , publisher=
work page 2020
- [49]
-
[50]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[51]
arXiv preprint arXiv:2210.07277 , year=
The hidden uniform cluster prior in self-supervised learning , author=. arXiv preprint arXiv:2210.07277 , year=
-
[52]
arXiv preprint arXiv:1901.07017 , year=
Spatial broadcast decoder: A simple architecture for learning disentangled representations in vaes , author=. arXiv preprint arXiv:1901.07017 , year=
-
[53]
Advances in neural information processing systems , volume=
Bootstrap your own latent-a new approach to self-supervised learning , author=. Advances in neural information processing systems , volume=
-
[54]
Transactions on Machine Learning Research , issn=
Maxime Oquab and Timoth. Transactions on Machine Learning Research , issn=. 2024 , url=
work page 2024
-
[55]
Proceedings of the IEEE/CVF international conference on computer vision , pages=
Emerging properties in self-supervised vision transformers , author=. Proceedings of the IEEE/CVF international conference on computer vision , pages=
-
[56]
International Conference on Learning Representations (ICLR) , year=
iBOT: Image BERT Pre-Training with Online Tokenizer , author=. International Conference on Learning Representations (ICLR) , year=
-
[57]
Navon’s classical paradigm concerning local and global processing relates systematically to visual object classification performance , author=. Scientific reports , volume=. 2018 , publisher=
work page 2018
-
[58]
Shape representation in the inferior temporal cortex of monkeys , author=. Current Biology , volume=. 1995 , publisher=
work page 1995
-
[59]
European conference on computer vision , pages=
Masked siamese networks for label-efficient learning , author=. European conference on computer vision , pages=. 2022 , organization=
work page 2022
-
[60]
Representation Learning with Contrastive Predictive Coding
Representation learning with contrastive predictive coding , author=. arXiv preprint arXiv:1807.03748 , year=
work page internal anchor Pith review Pith/arXiv arXiv
-
[61]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Dense contrastive learning for self-supervised visual pre-training , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[62]
International conference on machine learning , pages=
Barlow twins: Self-supervised learning via redundancy reduction , author=. International conference on machine learning , pages=. 2021 , organization=
work page 2021
-
[63]
2024 IEEE International Conference on Image Processing (ICIP) , pages=
Masked Momentum Contrastive Learning for Semantic Understanding by Observation , author=. 2024 IEEE International Conference on Image Processing (ICIP) , pages=. 2024 , organization=
work page 2024
-
[64]
Efficient self-supervised vision transformers for representation learning
Efficient self-supervised vision transformers for representation learning , author=. arXiv preprint arXiv:2106.09785 , year=
-
[65]
Advances in Neural Information Processing Systems , volume=
Vicregl: Self-supervised learning of local visual features , author=. Advances in Neural Information Processing Systems , volume=
-
[66]
arXiv preprint arXiv:2204.10926 , year=
Segdiscover: Visual concept discovery via unsupervised semantic segmentation , author=. arXiv preprint arXiv:2204.10926 , year=
-
[67]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Picie: Unsupervised semantic segmentation using invariance and equivariance in clustering , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[68]
arXiv preprint arXiv:2203.08414 , year=
Unsupervised semantic segmentation by distilling feature correspondences , author=. arXiv preprint arXiv:2203.08414 , year=
-
[69]
Advances in neural information processing systems , volume=
Self-supervised visual representation learning with semantic grouping , author=. Advances in neural information processing systems , volume=
-
[70]
An Image is Worth K Slots: Data-efficient Scaling of Self-supervised Visual Pre-training , author=. openreview.net , year=
-
[71]
Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
Croc: Cross-view online clustering for dense visual representation learning , author=. Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages=
-
[72]
Tim Lebailly and Thomas Stegm. Cr. The Twelfth International Conference on Learning Representations , year=
-
[73]
Advances in Neural Information Processing Systems , volume=
Unsupervised object-level representation learning from scene images , author=. Advances in Neural Information Processing Systems , volume=
-
[74]
Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
Efficient visual pretraining with contrastive detection , author=. Proceedings of the IEEE/CVF International Conference on Computer Vision , pages=
-
[75]
2023 IEEE International Conference on Image Processing (ICIP) , pages=
GMML is all you need , author=. 2023 IEEE International Conference on Image Processing (ICIP) , pages=. 2023 , organization=
work page 2023
-
[76]
The Thirteenth International Conference on Learning Representations , year=
Object-Centric Pretraining via Target Encoder Bootstrapping , author=. The Thirteenth International Conference on Learning Representations , year=
-
[77]
Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
Freesolo: Learning to segment objects without annotations , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=
-
[78]
Advances in Neural Information Processing Systems , volume=
Simple unsupervised object-centric learning for complex and naturalistic videos , author=. Advances in Neural Information Processing Systems , volume=
-
[79]
Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
Unsupervised feature learning via non-parametric instance discrimination , author=. Proceedings of the IEEE conference on computer vision and pattern recognition , pages=
-
[80]
Advances in neural information processing systems , volume=
Object-centric learning with slot attention , author=. Advances in neural information processing systems , volume=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.