A Theoretical Analysis of Contrastive Unsupervised Representation Learning

Sanjeev Arora , Hrishikesh Khandeparkar , Mikhail Khodak , Orestis Plevrakis , Nikunj Saunshi

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIstat.ML

keywords representationslatentsimilaraverageclassesclassificationcontrastivedata

read the original abstract

Recent empirical works have successfully used unlabeled data to learn feature representations that are broadly useful in downstream classification tasks. Several of these methods are reminiscent of the well-known word2vec embedding algorithm: leveraging availability of pairs of semantically "similar" data points and "negative samples," the learner forces the inner product of representations of similar pairs with each other to be higher on average than with negative samples. The current paper uses the term contrastive learning for such algorithms and presents a theoretical framework for analyzing them by introducing latent classes and hypothesizing that semantically similar points are sampled from the same latent class. This framework allows us to show provable guarantees on the performance of the learned representations on the average classification task that is comprised of a subset of the same set of latent classes. Our generalization bound also shows that learned representations can reduce (labeled) sample complexity on downstream tasks. We conduct controlled experiments in both the text and image domains to support the theory.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Optimal Representations for Generalized Contrastive Learning with Imbalanced Datasets
cs.LG 2026-05 unverdicted novelty 7.0

In generalized contrastive learning with imbalanced classes, optimal representations collapse to class means whose angular geometry is determined by class proportions via convex optimization, and extreme imbalance cau...
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
The Platonic Representation Hypothesis
cs.LG 2024-05 unverdicted novelty 5.0

Representations learned by large AI models are converging toward a shared statistical model of reality.