Statistical Consistency and Generalization of Contrastive Representation Learning

Tianbao Yang; Xiyuan Wei; Yiming Ying; Yuanfan Li

arxiv: 2605.02116 · v2 · pith:4KXYZZYOnew · submitted 2026-05-04 · 💻 cs.LG

Statistical Consistency and Generalization of Contrastive Representation Learning

Yuanfan Li , Xiyuan Wei , Tianbao Yang , Yiming Ying This is my paper

Pith reviewed 2026-05-21 08:54 UTC · model grok-4.3

classification 💻 cs.LG

keywords contrastive representation learningstatistical consistencygeneralization boundsretrieval rankingAUC criterioncalibration inequality

0 comments

The pith

The contrastive loss is statistically consistent with optimal ranking for retrieval tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper develops a unified statistical learning theory for contrastive representation learning. It proves that minimizing the contrastive loss produces optimal ranking under an AUC-type population criterion for retrieval quality. A calibration-style inequality is established to connect excess contrastive risk directly to excess retrieval suboptimality. Generalization bounds of order O(1/m + 1/sqrt(n)) for supervised and O(1/sqrt(m) + 1/sqrt(n)) for self-supervised cases are derived, which remain stable or improve as the number of negative samples m grows. These results explain the practical gains from large negative sets and reveal an explicit trade-off between m and the number of anchor points n.

Core claim

The contrastive loss is statistically consistent with optimal ranking and a calibration-style inequality quantitatively relates excess contrastive risk to excess retrieval suboptimality. Generalization bounds of order O(1/m + 1/sqrt(n)) and O(1/sqrt(m) + 1/sqrt(n)) are derived for supervised and self-supervised contrastive objectives, where m is the number of negative samples and n the number of anchor points.

What carries the argument

The calibration-style inequality that quantitatively relates excess contrastive risk to excess retrieval suboptimality under an AUC-type population criterion.

If this is right

Contrastive representations achieve optimal retrieval performance in the large-sample limit.
Increasing the number of negative samples does not degrade and can improve generalization bounds.
An explicit trade-off exists between the number of negative samples m and anchor points n for achieving target generalization.
The theory applies uniformly to both supervised and self-supervised contrastive training.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The consistency result could be used to design new contrastive objectives that target other retrieval metrics beyond AUC.
Practitioners might balance m and n according to the derived trade-off to optimize training under fixed compute.
The calibration inequality suggests a path to transfer consistency guarantees to other downstream tasks that can be cast as ranking problems.

Load-bearing premise

The minimizer of the population contrastive risk corresponds to the optimal retrieval ranking under the chosen AUC-type criterion.

What would settle it

A counterexample data distribution where the contrastive loss minimizer fails to achieve optimal ranking according to the AUC criterion, or empirical observation that generalization error increases with larger m.

Figures

Figures reproduced from arXiv: 2605.02116 by Tianbao Yang, Xiyuan Wei, Yiming Ying, Yuanfan Li.

**Figure 1.** Figure 1: (a): Zero-shot classification (left) and retrieval (right) results of CLIP training on different sizes of negative samples. n denotes the size of the anchor dataset, while m denotes the size of negative samples. (b): Critical size of m at different n, compared with m = √ n and m = n. 5. Empirical Verification In this section, we conduct experiments to empirically demonstrate the validity of our results in … view at source ↗

**Figure 2.** Figure 2: Zero-shot retrieval result on MSCOCO (left) and Flickr (right) of CLIP training on different sizes of negative samples. n denotes the size of the anchor dataset, while m denotes the size of negative samples. 53 [PITH_FULL_IMAGE:figures/full_fig_p053_2.png] view at source ↗

read the original abstract

Contrastive representation learning (CRL) underpins many modern foundation models. Despite recent theoretical progress, existing analyses suffer from several key limitations: (i) the statistical consistency of CRL remains poorly understood; (ii) available generalization bounds deteriorate as the number of negative samples increases, contradicting the empirical benefits of large negative sets; and (iii) the retrieval performance of CRL has received limited theoretical attention. In this paper, we develop a unified statistical learning theory for CRL. For downstream tasks, we evaluate retrieval quality using an AUC-type population criterion and show that the contrastive loss is \emph{statistically consistent} with optimal ranking. We further establish a \emph{calibration-style inequality} that quantitatively relates excess contrastive risk to excess retrieval suboptimality. For upstream training, we study both supervised and self-supervised contrastive objectives and derive generalization bounds of order $O(1/m + 1/\sqrt{n})$ and $O(1/\sqrt{m} + 1/\sqrt{n})$, respectively, where $m$ denotes the number of negative samples and $n$ the number of anchor points. These bounds not only explain the empirical advantages of large negative sets but also reveal an explicit trade-off between $m$ and $n$. Extensive experiments on large-scale vision--language models corroborate our theoretical predictions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives generalization bounds for contrastive learning that improve with more negatives and links excess risk to retrieval suboptimality via a calibration inequality.

read the letter

The main thing to know is that this work derives generalization bounds whose m-dependence improves rather than worsens, plus a calibration-style inequality that relates excess contrastive risk to excess retrieval suboptimality under an AUC-type criterion. That directly targets the mismatch between prior theory and the observed gains from large negative sets in practice. They also claim statistical consistency of the contrastive loss with optimal ranking for downstream retrieval. The supervised bound is O(1/m + 1/sqrt(n)) and the self-supervised one is O(1/sqrt(m) + 1/sqrt(n)), with experiments on vision-language models to check the predictions. This is the concrete advance over the limitations they attribute to earlier analyses. The calibration inequality is a useful bridge between upstream loss and downstream quality that prior work had not made explicit. The assumptions are standard i.i.d. sampling and boundedness, and the claims are presented as following from ordinary statistical learning arguments rather than circular self-reference. The soft spot is the generalization analysis itself. The stress-test concern about Rademacher complexity or uniform convergence terms growing with m is reasonable to check; if the proof relies on Lipschitz constants or bounded differences that stay independent of m, the rate holds, but any hidden accumulation would revert the bound to something slower and weaken the explanation for scaling negatives. The population correspondence between contrastive minimizer and optimal retrieval ranking is plausible under their loss but could be sensitive to distribution mismatch in real data. This is for theorists and practitioners working on representation learning and scaling of foundation models. A reader who wants quantitative justification for why more negatives help, or who needs a link from training risk to retrieval AUC, will find usable pieces here. It has enough specific new results and grounding to deserve a serious referee, even if the proofs require close scrutiny on the m-control step. I would recommend sending it to peer review.

Referee Report

1 major / 2 minor

Summary. The manuscript develops a unified statistical learning theory for contrastive representation learning (CRL). It shows that the contrastive loss is statistically consistent with optimal ranking under an AUC-type population criterion for retrieval, derives a calibration-style inequality relating excess contrastive risk to excess retrieval suboptimality, and provides generalization bounds of order O(1/m + 1/sqrt(n)) for supervised contrastive objectives and O(1/sqrt(m) + 1/sqrt(n)) for self-supervised objectives (m = number of negative samples, n = number of anchors). These results are supported by experiments on large-scale vision-language models.

Significance. If the derivations are correct, the work is significant because it supplies the first explicit explanation for why increasing the number of negatives improves CRL performance, resolving a contradiction with prior bounds that deteriorate in m. The consistency and calibration results address open questions about downstream retrieval quality. The m-n trade-off is practically useful. Credit is due for producing bounds that align with empirical practice and for including corroborating large-scale experiments.

major comments (1)

[§4 (Generalization analysis)] §4 (Generalization analysis): The O(1/m + 1/sqrt(n)) bound for the supervised case requires that the empirical-process deviation term for the m-negative contrastive loss does not grow with m. Standard symmetrization or chaining arguments produce Rademacher complexity that can scale as sqrt(m) or worse unless the proof explicitly invokes bounded differences, m-independent Lipschitz constants, or a covering-number bound that decouples the negatives. The manuscript must show the precise control used; absent this, the claimed rate reverts and the explanation for large-m gains no longer follows.

minor comments (2)

[Abstract and §2] The abstract and §2 should explicitly distinguish the supervised and self-supervised objectives when stating the two different rates.
[§3] Add a short remark on how the AUC-type retrieval criterion is chosen and why it is the appropriate population target for the consistency claim.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the detailed and constructive report. The single major comment raises a valid point about the technical control needed to ensure the generalization deviation term does not grow with m in the supervised bound. We address this directly below and will revise the manuscript to make the argument fully explicit.

read point-by-point responses

Referee: The O(1/m + 1/sqrt(n)) bound for the supervised case requires that the empirical-process deviation term for the m-negative contrastive loss does not grow with m. Standard symmetrization or chaining arguments produce Rademacher complexity that can scale as sqrt(m) or worse unless the proof explicitly invokes bounded differences, m-independent Lipschitz constants, or a covering-number bound that decouples the negatives. The manuscript must show the precise control used; absent this, the claimed rate reverts and the explanation for large-m gains no longer follows.

Authors: We agree that standard symmetrization would typically produce an undesirable sqrt(m) factor. Our proof of Theorem 4.1 (Appendix B) avoids this by applying McDiarmid's bounded-differences inequality directly to the per-anchor contrastive loss. Because the loss is an average over the m negatives and each term is bounded in [0,1], changing any single negative alters the loss by at most 2/m. The resulting concentration inequality therefore contributes an additive O(1/m) term (after union bound over n anchors) rather than a term that grows with m. The 1/sqrt(n) term arises from the usual empirical-process deviation over the n anchors. We will insert a short clarifying paragraph at the beginning of Section 4 and add an explicit remark in Appendix B that highlights this bounded-difference control and why it decouples the deviation from m. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivations use standard statistical learning arguments

full rationale

The paper derives statistical consistency of the contrastive loss with optimal AUC-type ranking and generalization bounds O(1/m + 1/sqrt(n)) and O(1/sqrt(m) + 1/sqrt(n)) from population risk minimization, calibration inequalities, and empirical process tools under i.i.d. sampling and boundedness assumptions. These steps do not reduce by construction to fitted parameters, self-citations, or renamed inputs; the central claims remain independent of the target results and rest on external statistical machinery rather than tautological redefinitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The analysis rests on standard statistical learning assumptions such as i.i.d. sampling of anchors and negatives, existence of a well-defined population risk, and sufficient regularity for the contrastive loss to admit generalization bounds; no new free parameters or invented entities are introduced in the abstract.

axioms (2)

domain assumption Training examples are drawn i.i.d. from an underlying data distribution
Required for all generalization bounds in statistical learning theory.
domain assumption The contrastive loss admits a population minimizer that corresponds to optimal retrieval under the AUC criterion
Central to the consistency and calibration claims.

pith-pipeline@v0.9.0 · 5769 in / 1510 out tokens · 57723 ms · 2026-05-21T08:54:29.247798+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We establish statistical consistency ... E* - E(s) ≲ sqrt(L(s) - L*) ... generalization bounds of order O(1/m + 1/sqrt(n))

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 4 internal anchors

[1]

International Conference on Machine Learning , pages=

Stability and Generalization of Stochastic Compositional Gradient Descent Algorithms , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024
[2]

arXiv preprint arXiv:2407.01445 , year=

FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources , author=. arXiv preprint arXiv:2407.01445 , year=

work page arXiv
[3]

error rate minimization , author=

AUC optimization vs. error rate minimization , author=. Advances in neural information processing systems , volume=

work page
[4]

Advances in neural information processing systems , volume=

Stochastic online AUC maximization , author=. Advances in neural information processing systems , volume=

work page
[5]

Summer school on machine learning , pages=

Introduction to statistical learning theory , author=. Summer school on machine learning , pages=. 2003 , publisher=

work page 2003
[6]

2013 , publisher=

The nature of statistical learning theory , author=. 2013 , publisher=

work page 2013
[7]

International conference on machine learning , pages=

On the surrogate gap between contrastive and supervised losses , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022
[8]

, author=

Generalization Bounds for the Area Under the ROC Curve. , author=. Journal of Machine Learning Research , volume=

work page
[9]

Annals of Statistics , volume=

Ranking and Empirical Minimization of U-statistics , author=. Annals of Statistics , volume=

work page
[10]

Journal of the American Statistical Association , volume=

Convexity, classification, and risk bounds , author=. Journal of the American Statistical Association , volume=. 2006 , publisher=

work page 2006
[11]

arXiv preprint arXiv:2311.03881 , year=

Sparse Contrastive Learning of Sentence Embeddings , author=. arXiv preprint arXiv:2311.03881 , year=

work page arXiv
[12]

Journal of Machine Learning Research , volume=

Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , author=. Journal of Machine Learning Research , volume=. 2002 , publisher=

work page 2002
[13]

International conference on machine learning , pages=

A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[14]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page
[15]

Advances in Neural Information Processing Systems , volume=

Understanding negative samples in instance discriminative self-supervised representation learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[16]

International Conference on Artificial Intelligence and Statistics , pages=

Investigating the Role of Negatives in Contrastive Representation Learning , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022
[17]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000
[18]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980
[19]

M. J. Kearns , title =

work page
[20]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983
[21]

Proceedings of the 40th International Conference on Machine Learning , pages =

Generalization Analysis for Contrastive Representation Learning , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[22]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000
[23]

Suppressed for Anonymity , author=

work page
[24]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981
[25]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959
[26]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[27]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[28]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[29]

Journal of machine learning research , volume=

Stability and generalization , author=. Journal of machine learning research , volume=

work page
[30]

2024 , eprint=

Generalization Analysis for Deep Contrastive Representation Learning , author=. 2024 , eprint=

work page 2024
[31]

2021 , eprint=

Learning Bounds for Risk-sensitive Learning , author=. 2021 , eprint=

work page 2021
[32]

2024 , eprint=

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP , author=. 2024 , eprint=

work page 2024
[33]

2022 , eprint=

On the Surrogate Gap between Contrastive and Supervised Losses , author=. 2022 , eprint=

work page 2022
[34]

2025 , eprint=

A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI , author=. 2025 , eprint=

work page 2025
[35]

Proceedings of the 38th International Conference on Machine Learning , pages =

Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021
[36]

Proceedings of the 36th International Conference on Machine Learning , pages =

A Theoretical Analysis of Contrastive Unsupervised Representation Learning , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

work page 2019
[37]

International conference on machine learning , pages=

Understanding contrastive representation learning through alignment and uniformity on the hypersphere , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020
[38]

Proceedings of the 39th International Conference on Machine Learning , pages =

Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022
[39]

2014 , eprint=

On the Consistency of AUC Pairwise Optimization , author=. 2014 , eprint=

work page 2014
[40]

2025 , eprint=

Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning , author=. 2025 , eprint=

work page 2025
[41]

2023 , eprint=

Understanding Contrastive Learning via Distributionally Robust Optimization , author=. 2023 , eprint=

work page 2023
[42]

Generalization bounds for learning under graph-dependence: a survey , volume=

Zhang, Rui-Ray and Amini, Massih-Reza , year=. Generalization bounds for learning under graph-dependence: a survey , volume=. Machine Learning , publisher=. doi:10.1007/s10994-024-06536-9 , number=

work page doi:10.1007/s10994-024-06536-9
[43]

2025 , eprint=

Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval , author=. 2025 , eprint=

work page 2025
[44]

2023 , organization=

AUC-CL: A Batchsize-Robust Framework for Self-Supervised Contrastive Representation Learning , author=. 2023 , organization=

work page 2023
[45]

2023 , eprint=

Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization , author=. 2023 , eprint=

work page 2023
[46]

Mathematical Finance , volume=

An old-new concept of convex risk measures: The optimized certainty equivalent , author=. Mathematical Finance , volume=. 2007 , publisher=

work page 2007
[47]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025
[48]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page
[49]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[50]

2025 , eprint=

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric , author=. 2025 , eprint=

work page 2025
[51]

The Twelfth International Conference on Learning Representations , year=

Data Filtering Networks , author=. The Twelfth International Conference on Learning Representations , year=

work page
[52]

Transactions of the Association for Computational Linguistics , volume=

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , author=. Transactions of the Association for Computational Linguistics , volume=. 2014 , publisher=

work page 2014
[53]

Microsoft COCO Captions: Data Collection and Evaluation Server

Microsoft coco captions: Data collection and evaluation server , author=. arXiv preprint arXiv:1504.00325 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[54]

2018 , eprint=

Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination , author=. 2018 , eprint=

work page 2018
[55]

2017 , eprint=

Spectrally-normalized margin bounds for neural networks , author=. 2017 , eprint=

work page 2017
[56]

2018 , publisher=

Foundations of machine learning , author=. 2018 , publisher=

work page 2018
[57]

2013 , publisher=

Probability in Banach Spaces: isoperimetry and processes , author=. 2013 , publisher=

work page 2013
[58]

Journal of Machine Learning Research , year =

Xin Zou and Weiwei Liu , title =. Journal of Machine Learning Research , year =

work page
[59]

2020 , eprint=

PAC-Bayesian Contrastive Unsupervised Representation Learning , author=. 2020 , eprint=

work page 2020
[60]

2025 , eprint=

A Generalization Theory for Zero-Shot Prediction , author=. 2025 , eprint=

work page 2025
[61]

Advances in neural information processing systems , volume=

Provable guarantees for self-supervised deep learning with spectral contrastive loss , author=. Advances in neural information processing systems , volume=

work page
[62]

Advances in Neural Information Processing Systems , volume=

Predicting what you already know helps: Provable self-supervised learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[63]

Proceedings of the 38th International Conference on Machine Learning , pages =

Understanding self-supervised learning dynamics without contrastive pairs , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021
[64]

2021 , eprint=

Self-supervised Learning from a Multi-view Perspective , author=. 2021 , eprint=

work page 2021
[65]

2022 , eprint=

Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap , author=. 2022 , eprint=

work page 2022
[66]

Statistics & probability letters , volume=

A note on margin-based loss functions in classification , author=. Statistics & probability letters , volume=. 2004 , publisher=

work page 2004
[67]

2020 , eprint=

On the Consistency of Top-k Surrogate Losses , author=. 2020 , eprint=

work page 2020
[68]

The Annals of Statistics , volume=

Statistical behavior and consistency of classification methods based on convex risk minimization , author=. The Annals of Statistics , volume=. 2004 , publisher=

work page 2004
[69]

Proceedings of the 40th International Conference on Machine Learning , pages =

Label Distributionally Robust Losses for Multi-class Classification: Consistency, Robustness and Adaptivity , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023
[70]

Learning with Average Top-k Loss , url =

Fan, Yanbo and Lyu, Siwei and Ying, Yiming and Hu, Baogang , booktitle =. Learning with Average Top-k Loss , url =

work page
[71]

Machine learning , volume=

Calibration and regret bounds for order-preserving surrogate losses in learning to rank , author=. Machine learning , volume=. 2013 , publisher=

work page 2013
[72]

Advances in neural information processing systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=

work page
[73]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , volume=

work page internal anchor Pith review Pith/arXiv arXiv
[74]

Neural computation , volume=

Deep clustering with a constraint for topological invariance based on symmetric infonce , author=. Neural computation , volume=. 2023 , publisher=

work page 2023
[75]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009
[76]

2019 , publisher=

High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=

work page 2019
[77]

International conference on machine learning , pages=

Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021
[78]

, author=

Dense Passage Retrieval for Open-Domain Question Answering. , author=. EMNLP (1) , pages=

work page
[79]

SIAM Journal on Optimization , volume=

Sample complexity of sample average approximation for conditional stochastic optimization , author=. SIAM Journal on Optimization , volume=. 2020 , publisher=

work page 2020
[80]

arXiv preprint arXiv:2510.25983 , year=

Contrastive Predictive Coding Done Right for Mutual Information Estimation , author=. arXiv preprint arXiv:2510.25983 , year=

work page arXiv

Showing first 80 references.

[1] [1]

International Conference on Machine Learning , pages=

Stability and Generalization of Stochastic Compositional Gradient Descent Algorithms , author=. International Conference on Machine Learning , pages=. 2024 , organization=

work page 2024

[2] [2]

arXiv preprint arXiv:2407.01445 , year=

FastCLIP: A Suite of Optimization Techniques to Accelerate CLIP Training with Limited Resources , author=. arXiv preprint arXiv:2407.01445 , year=

work page arXiv

[3] [3]

error rate minimization , author=

AUC optimization vs. error rate minimization , author=. Advances in neural information processing systems , volume=

work page

[4] [4]

Advances in neural information processing systems , volume=

Stochastic online AUC maximization , author=. Advances in neural information processing systems , volume=

work page

[5] [5]

Summer school on machine learning , pages=

Introduction to statistical learning theory , author=. Summer school on machine learning , pages=. 2003 , publisher=

work page 2003

[6] [6]

2013 , publisher=

The nature of statistical learning theory , author=. 2013 , publisher=

work page 2013

[7] [7]

International conference on machine learning , pages=

On the surrogate gap between contrastive and supervised losses , author=. International conference on machine learning , pages=. 2022 , organization=

work page 2022

[8] [8]

, author=

Generalization Bounds for the Area Under the ROC Curve. , author=. Journal of Machine Learning Research , volume=

work page

[9] [9]

Annals of Statistics , volume=

Ranking and Empirical Minimization of U-statistics , author=. Annals of Statistics , volume=

work page

[10] [10]

Journal of the American Statistical Association , volume=

Convexity, classification, and risk bounds , author=. Journal of the American Statistical Association , volume=. 2006 , publisher=

work page 2006

[11] [11]

arXiv preprint arXiv:2311.03881 , year=

Sparse Contrastive Learning of Sentence Embeddings , author=. arXiv preprint arXiv:2311.03881 , year=

work page arXiv

[12] [12]

Journal of Machine Learning Research , volume=

Rademacher and Gaussian Complexities: Risk Bounds and Structural Results , author=. Journal of Machine Learning Research , volume=. 2002 , publisher=

work page 2002

[13] [13]

International conference on machine learning , pages=

A simple framework for contrastive learning of visual representations , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[14] [14]

Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

Momentum contrast for unsupervised visual representation learning , author=. Proceedings of the IEEE/CVF conference on computer vision and pattern recognition , pages=

work page

[15] [15]

Advances in Neural Information Processing Systems , volume=

Understanding negative samples in instance discriminative self-supervised representation learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[16] [16]

International Conference on Artificial Intelligence and Statistics , pages=

Investigating the Role of Negatives in Contrastive Representation Learning , author=. International Conference on Artificial Intelligence and Statistics , pages=. 2022 , organization=

work page 2022

[17] [17]

Langley , title =

P. Langley , title =. Proceedings of the 17th International Conference on Machine Learning (ICML 2000) , address =. 2000 , pages =

work page 2000

[18] [18]

T. M. Mitchell. The Need for Biases in Learning Generalizations. 1980

work page 1980

[19] [19]

M. J. Kearns , title =

work page

[20] [20]

Machine Learning: An Artificial Intelligence Approach, Vol. I. 1983

work page 1983

[21] [21]

Proceedings of the 40th International Conference on Machine Learning , pages =

Generalization Analysis for Contrastive Representation Learning , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023

[22] [22]

R. O. Duda and P. E. Hart and D. G. Stork. Pattern Classification. 2000

work page 2000

[23] [23]

Suppressed for Anonymity , author=

work page

[24] [24]

Newell and P

A. Newell and P. S. Rosenbloom. Mechanisms of Skill Acquisition and the Law of Practice. Cognitive Skills and Their Acquisition. 1981

work page 1981

[25] [25]

A. L. Samuel. Some Studies in Machine Learning Using the Game of Checkers. IBM Journal of Research and Development. 1959

work page 1959

[26] [26]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page

[27] [27]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page

[28] [28]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016

[29] [29]

Journal of machine learning research , volume=

Stability and generalization , author=. Journal of machine learning research , volume=

work page

[30] [30]

2024 , eprint=

Generalization Analysis for Deep Contrastive Representation Learning , author=. 2024 , eprint=

work page 2024

[31] [31]

2021 , eprint=

Learning Bounds for Risk-sensitive Learning , author=. 2021 , eprint=

work page 2021

[32] [32]

2024 , eprint=

Understanding Transferable Representation Learning and Zero-shot Transfer in CLIP , author=. 2024 , eprint=

work page 2024

[33] [33]

2022 , eprint=

On the Surrogate Gap between Contrastive and Supervised Losses , author=. 2022 , eprint=

work page 2022

[34] [34]

2025 , eprint=

A Statistical Theory of Contrastive Pre-training and Multimodal Generative AI , author=. 2025 , eprint=

work page 2025

[35] [35]

Proceedings of the 38th International Conference on Machine Learning , pages =

Learning Transferable Visual Models From Natural Language Supervision , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021

[36] [36]

Proceedings of the 36th International Conference on Machine Learning , pages =

A Theoretical Analysis of Contrastive Unsupervised Representation Learning , author =. Proceedings of the 36th International Conference on Machine Learning , pages =. 2019 , editor =

work page 2019

[37] [37]

International conference on machine learning , pages=

Understanding contrastive representation learning through alignment and uniformity on the hypersphere , author=. International conference on machine learning , pages=. 2020 , organization=

work page 2020

[38] [38]

Proceedings of the 39th International Conference on Machine Learning , pages =

Provable Stochastic Optimization for Global Contrastive Learning: Small Batch Does Not Harm Performance , author =. Proceedings of the 39th International Conference on Machine Learning , pages =. 2022 , editor =

work page 2022

[39] [39]

2014 , eprint=

On the Consistency of AUC Pairwise Optimization , author=. 2014 , eprint=

work page 2014

[40] [40]

2025 , eprint=

Self-Supervised Contrastive Learning is Approximately Supervised Contrastive Learning , author=. 2025 , eprint=

work page 2025

[41] [41]

2023 , eprint=

Understanding Contrastive Learning via Distributionally Robust Optimization , author=. 2023 , eprint=

work page 2023

[42] [42]

Generalization bounds for learning under graph-dependence: a survey , volume=

Zhang, Rui-Ray and Amini, Massih-Reza , year=. Generalization bounds for learning under graph-dependence: a survey , volume=. Machine Learning , publisher=. doi:10.1007/s10994-024-06536-9 , number=

work page doi:10.1007/s10994-024-06536-9

[43] [43]

2025 , eprint=

Optimizing What Matters: AUC-Driven Learning for Robust Neural Retrieval , author=. 2025 , eprint=

work page 2025

[44] [44]

2023 , organization=

AUC-CL: A Batchsize-Robust Framework for Self-Supervised Contrastive Representation Learning , author=. 2023 , organization=

work page 2023

[45] [45]

2023 , eprint=

Not All Semantics are Created Equal: Contrastive Self-supervised Learning with Automatic Temperature Individualization , author=. 2023 , eprint=

work page 2023

[46] [46]

Mathematical Finance , volume=

An old-new concept of convex risk measures: The optimized certainty equivalent , author=. Mathematical Finance , volume=. 2007 , publisher=

work page 2007

[47] [47]

2025 , eprint=

Qwen3 Technical Report , author=. 2025 , eprint=

work page 2025

[48] [48]

arXiv e-prints , pages=

The llama 3 herd of models , author=. arXiv e-prints , pages=

work page

[49] [49]

DeepSeek-R1: Incentivizing Reasoning Capability in LLMs via Reinforcement Learning

Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning , author=. arXiv preprint arXiv:2501.12948 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[50] [50]

2025 , eprint=

Weighted Point Set Embedding for Multimodal Contrastive Learning Toward Optimal Similarity Metric , author=. 2025 , eprint=

work page 2025

[51] [51]

The Twelfth International Conference on Learning Representations , year=

Data Filtering Networks , author=. The Twelfth International Conference on Learning Representations , year=

work page

[52] [52]

Transactions of the Association for Computational Linguistics , volume=

From image descriptions to visual denotations: New similarity metrics for semantic inference over event descriptions , author=. Transactions of the Association for Computational Linguistics , volume=. 2014 , publisher=

work page 2014

[53] [53]

Microsoft COCO Captions: Data Collection and Evaluation Server

Microsoft coco captions: Data collection and evaluation server , author=. arXiv preprint arXiv:1504.00325 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[54] [54]

2018 , eprint=

Unsupervised Feature Learning via Non-Parametric Instance-level Discrimination , author=. 2018 , eprint=

work page 2018

[55] [55]

2017 , eprint=

Spectrally-normalized margin bounds for neural networks , author=. 2017 , eprint=

work page 2017

[56] [56]

2018 , publisher=

Foundations of machine learning , author=. 2018 , publisher=

work page 2018

[57] [57]

2013 , publisher=

Probability in Banach Spaces: isoperimetry and processes , author=. 2013 , publisher=

work page 2013

[58] [58]

Journal of Machine Learning Research , year =

Xin Zou and Weiwei Liu , title =. Journal of Machine Learning Research , year =

work page

[59] [59]

2020 , eprint=

PAC-Bayesian Contrastive Unsupervised Representation Learning , author=. 2020 , eprint=

work page 2020

[60] [60]

2025 , eprint=

A Generalization Theory for Zero-Shot Prediction , author=. 2025 , eprint=

work page 2025

[61] [61]

Advances in neural information processing systems , volume=

Provable guarantees for self-supervised deep learning with spectral contrastive loss , author=. Advances in neural information processing systems , volume=

work page

[62] [62]

Advances in Neural Information Processing Systems , volume=

Predicting what you already know helps: Provable self-supervised learning , author=. Advances in Neural Information Processing Systems , volume=

work page

[63] [63]

Proceedings of the 38th International Conference on Machine Learning , pages =

Understanding self-supervised learning dynamics without contrastive pairs , author =. Proceedings of the 38th International Conference on Machine Learning , pages =. 2021 , editor =

work page 2021

[64] [64]

2021 , eprint=

Self-supervised Learning from a Multi-view Perspective , author=. 2021 , eprint=

work page 2021

[65] [65]

2022 , eprint=

Chaos is a Ladder: A New Theoretical Understanding of Contrastive Learning via Augmentation Overlap , author=. 2022 , eprint=

work page 2022

[66] [66]

Statistics & probability letters , volume=

A note on margin-based loss functions in classification , author=. Statistics & probability letters , volume=. 2004 , publisher=

work page 2004

[67] [67]

2020 , eprint=

On the Consistency of Top-k Surrogate Losses , author=. 2020 , eprint=

work page 2020

[68] [68]

The Annals of Statistics , volume=

Statistical behavior and consistency of classification methods based on convex risk minimization , author=. The Annals of Statistics , volume=. 2004 , publisher=

work page 2004

[69] [69]

Proceedings of the 40th International Conference on Machine Learning , pages =

Label Distributionally Robust Losses for Multi-class Classification: Consistency, Robustness and Adaptivity , author =. Proceedings of the 40th International Conference on Machine Learning , pages =. 2023 , editor =

work page 2023

[70] [70]

Learning with Average Top-k Loss , url =

Fan, Yanbo and Lyu, Siwei and Ying, Yiming and Hu, Baogang , booktitle =. Learning with Average Top-k Loss , url =

work page

[71] [71]

Machine learning , volume=

Calibration and regret bounds for order-preserving surrogate losses in learning to rank , author=. Machine learning , volume=. 2013 , publisher=

work page 2013

[72] [72]

Advances in neural information processing systems , volume=

Flamingo: a visual language model for few-shot learning , author=. Advances in neural information processing systems , volume=

work page

[73] [73]

Hierarchical Text-Conditional Image Generation with CLIP Latents

Hierarchical text-conditional image generation with clip latents , author=. arXiv preprint arXiv:2204.06125 , volume=

work page internal anchor Pith review Pith/arXiv arXiv

[74] [74]

Neural computation , volume=

Deep clustering with a constraint for topological invariance based on symmetric infonce , author=. Neural computation , volume=. 2023 , publisher=

work page 2023

[75] [75]

2009 IEEE conference on computer vision and pattern recognition , pages=

Imagenet: A large-scale hierarchical image database , author=. 2009 IEEE conference on computer vision and pattern recognition , pages=. 2009 , organization=

work page 2009

[76] [76]

2019 , publisher=

High-dimensional statistics: A non-asymptotic viewpoint , author=. 2019 , publisher=

work page 2019

[77] [77]

International conference on machine learning , pages=

Scaling up visual and vision-language representation learning with noisy text supervision , author=. International conference on machine learning , pages=. 2021 , organization=

work page 2021

[78] [78]

, author=

Dense Passage Retrieval for Open-Domain Question Answering. , author=. EMNLP (1) , pages=

work page

[79] [79]

SIAM Journal on Optimization , volume=

Sample complexity of sample average approximation for conditional stochastic optimization , author=. SIAM Journal on Optimization , volume=. 2020 , publisher=

work page 2020

[80] [80]

arXiv preprint arXiv:2510.25983 , year=

Contrastive Predictive Coding Done Right for Mutual Information Estimation , author=. arXiv preprint arXiv:2510.25983 , year=

work page arXiv