arxiv: 2002.05709 · v3 · submitted 2020-02-13 · 💻 cs.LG · cs.CV· stat.ML

Recognition: 2 theorem links

· Lean Theorem

A Simple Framework for Contrastive Learning of Visual Representations

Ting Chen , Simon Kornblith , Mohammad Norouzi , Geoffrey Hinton

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:28 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML

keywords contrastive learningself-supervised learningvisual representationsdata augmentationImageNetprojection headlinear probe

0 comments

The pith

A contrastive self-supervised framework learns ImageNet representations that match a supervised ResNet-50.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SimCLR as a streamlined contrastive learning approach for visual representations that dispenses with memory banks or specialized networks. It systematically tests framework components and finds that particular data-augmentation combinations, a learnable nonlinear projection head, and larger batch sizes together drive strong performance. When a linear classifier is trained on top of the resulting features, top-1 accuracy reaches 76.5 percent on ImageNet, a 7 percent relative gain over prior self-supervised methods and on par with fully supervised ResNet-50. Readers should care because the work shows that simple, scalable contrastive objectives can close most of the gap with supervised learning while using no labels for the representation stage.

Core claim

By using two randomly augmented views of each image as a positive pair and all other images in the batch as negatives, SimCLR trains an encoder followed by a nonlinear projection head under a contrastive loss; the resulting representations, when evaluated with a linear classifier, reach 76.5 percent top-1 accuracy on ImageNet and 85.8 percent top-5 accuracy when fine-tuned on only 1 percent of the labels.

What carries the argument

The SimCLR contrastive prediction task, which treats two augmented views of the same image as positives and uses a learnable nonlinear projection head to map representations into the space where the loss is applied.

If this is right

Contrastive learning benefits more from very large batch sizes and longer training than supervised classification does.
Effective predictive tasks arise mainly from composing multiple data augmentations rather than from any single transform.
Inserting a nonlinear projection head between the representation and the contrastive loss measurably improves downstream linear-probe accuracy.
The same representations support strong semi-supervised fine-tuning when only 1 percent of ImageNet labels are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same augmentation-plus-projection recipe could be tested on video or audio to see whether the same scaling laws hold outside static images.
If the method continues to improve with even larger batches and longer training, self-supervised pretraining might become the default first stage for most vision pipelines.
The framework's simplicity suggests it can serve as a reproducible baseline for studying how much further contrastive objectives can be pushed without architectural novelty.

Load-bearing premise

That the particular choices of data-augmentation composition and the nonlinear projection head are the main sources of the observed gains rather than interactions with untested optimizer or architecture details.

What would settle it

Retraining the identical encoder and loss with either a single augmentation policy or a linear projection head and observing that top-1 linear-probe accuracy falls below 70 percent on ImageNet would falsify the claimed importance of those two components.

read the original abstract

This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

SimCLR shows a stripped-down contrastive setup with the right augmentations and projection head can hit 76.5% linear top-1 on ImageNet, matching supervised ResNet-50.

read the letter

SimCLR shows a stripped-down contrastive setup with the right augmentations and projection head can hit 76.5% linear top-1 on ImageNet, matching supervised ResNet-50. The main advance is the clean recipe that drops memory banks and fancy architectures while still beating prior self-supervised numbers by a clear margin. They back this with ablation tables that isolate two practical levers: the specific composition of data augmentations matters a lot, and inserting a learnable nonlinear projection head before the contrastive loss improves the final representations. Larger batches and longer training also help, which aligns with how contrastive losses work. The numbers come from held-out ImageNet validation and are reported with enough detail to reproduce the main result. The semi-supervised fine-tuning numbers with 1% labels are a nice extra but not the core claim. The soft spot is that batch size, temperature, and other hyperparameters were tuned on the same ImageNet benchmark used for the final numbers, so some of the gain could be benchmark-specific rather than fully general. They stick mostly to ResNet-50 and ImageNet, so transfer to other architectures or datasets is not deeply tested. Still, the ablations are honest and the central performance claim holds up without circularity. This paper is for anyone building self-supervised vision models or looking for a strong, simple baseline. The thinking is straightforward and the experiments are scoped tightly enough to be useful. I would cite the augmentation and projection-head findings in related work. Send it to peer review.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SimCLR, a simplified contrastive self-supervised learning framework for visual representations. It removes the need for specialized architectures or memory banks, relying instead on standard ResNet backbones, a composition of data augmentations, a learnable nonlinear projection head, and the NT-Xent contrastive loss. Systematic ablations isolate the contributions of augmentation composition, the projection head, and large batch sizes. The central empirical result is that a linear classifier trained on the learned representations achieves 76.5% top-1 accuracy on ImageNet, a 7% relative improvement over prior self-supervised methods and matching the performance of a supervised ResNet-50; semi-supervised fine-tuning with 1% labels reaches 85.8% top-5 accuracy.

Significance. If the results hold, this work is significant because it demonstrates that a simple contrastive recipe can match supervised baselines on a large-scale benchmark while supplying ablation evidence that clarifies the roles of data augmentation and the projection head. These insights have shaped subsequent representation-learning research. The manuscript supplies exact training protocols and ablation tables, supporting reproducibility of the headline numbers.

major comments (1)

[Experiments] Experiments section: batch size and temperature are tuned on the ImageNet validation set that is also used to report the final 76.5% top-1 accuracy. This creates a risk that the headline number reflects hyperparameter overfitting to the evaluation distribution rather than a robust improvement; a separate tuning split or cross-validation protocol would be needed to confirm the claim.

minor comments (2)

[Figure 2] The caption of the main framework figure would be clearer if it explicitly distinguished the representation vector h from the projected vector z and indicated where the contrastive loss is computed.
[§3.2] §3.2: while the NT-Xent loss is standard, writing its explicit normalized-temperature cross-entropy formula would improve self-contained readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses

Referee: [Experiments] Experiments section: batch size and temperature are tuned on the ImageNet validation set that is also used to report the final 76.5% top-1 accuracy. This creates a risk that the headline number reflects hyperparameter overfitting to the evaluation distribution rather than a robust improvement; a separate tuning split or cross-validation protocol would be needed to confirm the claim.

Authors: We appreciate the referee's observation. Hyperparameters including batch size and temperature were indeed selected using the ImageNet validation set, following the standard protocol for linear evaluation on this benchmark (where test labels remain unavailable). However, the manuscript's primary claims rest on systematic ablations that isolate the effects of augmentation composition and the nonlinear projection head; these trends hold across wide ranges of batch sizes (256 to 8192) and temperatures (0.05 to 0.5) without requiring the final reported configuration. The headline 76.5% result is also consistent with the supervised ResNet-50 baseline under identical evaluation, which itself uses the same validation set. In the revised manuscript we will add an explicit paragraph in Section 4 clarifying the hyperparameter selection process, noting the common practice in the field, and stating that the core architectural and augmentation insights were validated independently of the final hyperparameter choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript is an empirical study introducing the SimCLR framework for contrastive self-supervised learning. All reported performance figures (e.g., 76.5% linear top-1 on ImageNet validation) are obtained by training the described recipe and measuring accuracy on held-out data against external baselines. No equations, predictions, or first-principles derivations appear that reduce to fitted inputs or self-citations by construction. Ablation tables isolate the effects of augmentations, projection head, and batch size without circular re-use of the target metric. Self-citations to prior contrastive work are not load-bearing for the central empirical claims.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on standard contrastive-learning assumptions plus a small number of fitted hyperparameters; no new physical entities are postulated.

free parameters (2)

temperature parameter in NT-Xent loss
Controls the sharpness of the contrastive distribution; chosen via validation performance.
batch size
Shown to improve performance when increased; selected as a large value (thousands) based on experiments.

axioms (1)

domain assumption Maximizing agreement between positive pairs while contrasting against negatives yields useful representations.
Core premise of the contrastive prediction task, stated in the framework description.

pith-pipeline@v0.9.0 · 5517 in / 1329 out tokens · 49727 ms · 2026-05-13T18:28:00.618751+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

ℓ_{i,j} = −log [exp(sim(z_i,z_j)/τ) / ∑ exp(sim(z_i,z_k)/τ)] with sim = cosine similarity
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

linear classifier on frozen ResNet-50 representations yields 76.5% top-1

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning
cs.LG 2026-05 unverdicted novelty 7.0

SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude ...
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
cs.LG 2026-04 unverdicted novelty 7.0

NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
Self-Directed Task Identification
cs.LG 2026-04 unverdicted novelty 7.0

SDTI lets models identify the correct target variable in datasets in a zero-shot setting using standard neural networks, beating baselines by 14% F1 on synthetic benchmarks.
BEiT: BERT Pre-Training of Image Transformers
cs.CV 2021-06 conditional novelty 7.0

BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.
Mastering Atari with Discrete World Models
cs.LG 2020-10 accept novelty 7.0

DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization
q-bio.NC 2026-05 unverdicted novelty 6.0

Direction maps and pinwheel structures in MT emerge spontaneously when a spatiotemporal deep network is trained on videos with contrastive self-supervised learning and spatial regularization.
CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking
cs.CV 2026-05 unverdicted novelty 6.0

CalibFree enables calibration-free multi-camera tracking via self-supervised feature separation through single-view distillation and cross-view reconstruction, reporting 3% higher accuracy and 7.5% better F1 on tested...
An Interpretable and Scalable Framework for Evaluating Large Language Models
stat.ML 2026-05 unverdicted novelty 6.0

A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
Velox: Learning Representations of 4D Geometry and Appearance
cs.CV 2026-05 unverdicted novelty 6.0

Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching
cs.CV 2026-04 unverdicted novelty 6.0

ShapeY is a benchmark dataset and nearest-neighbor protocol that measures shape-based recognition in vision models, revealing that even state-of-the-art networks fail to generalize consistently across 3D viewpoints an...
StarCLR: Contrastive Learning Representation for Astronomical Light Curves
astro-ph.SR 2026-04 conditional novelty 6.0

StarCLR pretrains on TESS light curves via contrastive learning on overlapping subsequences and improves variable star classification F1 scores over scratch-trained models when fine-tuned on TESS, ZTF, and Gaia.
Only Brains Align with Brains: Cross-Region Alignment Patterns Expose Limits of Normative Models
q-bio.NC 2026-04 unverdicted novelty 6.0

Alignment pattern analysis reveals that models aligned to individual brain ROIs do not reproduce the stable cross-region alignment profiles observed across human subjects.
Self-supervised Pretraining of Cell Segmentation Models
cs.CV 2026-04 unverdicted novelty 6.0

DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
Zero-shot World Models Are Developmentally Efficient Learners
cs.AI 2026-04 unverdicted novelty 6.0

A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
Masked Contrastive Pre-Training Improves Music Audio Key Detection
cs.SD 2026-04 unverdicted novelty 6.0

Masked contrastive pre-training on music audio yields representations that achieve SOTA key detection performance in the supervised setting without sophisticated augmentations.
Revisiting Feature Prediction for Learning Visual Representations from Video
cs.CV 2024-02 conditional novelty 6.0

V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model
cs.LG 2026-05 unverdicted novelty 5.0

A contrastive-learning ECG foundation model with multitask heads predicts post-MI outcomes better than training from scratch (AUC 0.794 vs 0.608).
Information theoretic underpinning of self-supervised learning by clustering
cs.LG 2026-05 unverdicted novelty 5.0

SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling
cs.LG 2026-04 unverdicted novelty 5.0

Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.
OC-Distill: Ontology-aware Contrastive Learning with Cross-Modal Distillation for ICU Risk Prediction
cs.LG 2026-04 unverdicted novelty 5.0

OC-Distill combines ontology-aware contrastive pretraining with cross-modal distillation to improve ICU risk prediction performance and label efficiency while using only vital signs at inference.
Improved Baselines with Momentum Contrastive Learning
cs.CV 2020-03 accept novelty 5.0

Adding an MLP projection head and enhanced augmentations to MoCo produces stronger unsupervised vision baselines that beat SimCLR while using smaller batches.
Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection
eess.AS 2026-04 unverdicted novelty 4.0

Cosine similarity in SupCon with a delayed negative queue on wav2vec2 XLS-R yields the lowest equal error rates for deepfake audio detection on in-the-wild and pooled evaluations.
LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning
cs.SE 2026-04 unverdicted novelty 4.0

LLMSniffer improves detection of LLM-generated code on GPTSniffer and Whodunit benchmarks by fine-tuning GraphCodeBERT via two-stage supervised contrastive learning plus preprocessing and MLP classification.
Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization
cs.CV 2026-04 unverdicted novelty 4.0

DINO-based ViT models pretrained on HPA FOV achieve macro F1 of 0.822 zero-shot and 0.860 after fine-tuning for protein localization on OpenCell, demonstrating effective transfer from SSL pretraining.
LLMs Struggle with Abstract Meaning Comprehension More Than Expected
cs.CL 2026-04 unverdicted novelty 3.0

LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 25 Pith papers · 8 internal anchors

[1]

M., Rupprecht, C., and Vedaldi, A

Asano, Y. M., Rupprecht, C., and Vedaldi, A. A critical analysis of self-supervision, or what we can learn from a single image. arXiv preprint arXiv:1904.13132, 2019

work page arXiv 1904
[2]

D., and Buchwalter, W

Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp.\ 15509--15519, 2019

work page 2019
[3]

and Hinton, G

Becker, S. and Hinton, G. E. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355 0 (6356): 0 161--163, 1992

work page 1992
[4]

W., Alexander, M

Berg, T., Liu, J., Lee, S. W., Alexander, M. L., Jacobs, D. W., and Belhumeur, P. N. Birdsnap: Large-scale fine-grained visual categorization of birds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2019--2026. IEEE, 2014

work page 2019
[5]

Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. A. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp.\ 5050--5060, 2019

work page 2019
[6]

Food-101--mining discriminative components with random forests

Bossard, L., Guillaumin, M., and Van Gool, L. Food-101--mining discriminative components with random forests. In European conference on computer vision, pp.\ 446--461. Springer, 2014

work page 2014
[7]

On sampling strategies for neural network-based collaborative filtering

Chen, T., Sun, Y., Shi, Y., and Hong, L. On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.\ 767--776, 2017

work page 2017
[8]

Self-supervised gans via auxiliary rotation loss

Chen, T., Zhai, X., Ritter, M., Lucic, M., and Houlsby, N. Self-supervised gans via auxiliary rotation loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 12154--12163, 2019

work page 2019
[9]

Describing textures in the wild

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and Vedaldi, A. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3606--3613. IEEE, 2014

work page 2014
[10]

D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q

Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 113--123, 2019

work page 2019
[11]

Improved Regularization of Convolutional Neural Networks with Cutout

DeVries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017

work page internal anchor Pith review arXiv 2017
[12]

and Zisserman, A

Doersch, C. and Zisserman, A. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 2051--2060, 2017

work page 2051
[13]

Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 1422--1430, 2015

work page 2015
[14]

and Simonyan, K

Donahue, J. and Simonyan, K. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp.\ 10541--10551, 2019

work page 2019
[15]

Decaf: A deep convolutional activation feature for generic visual recognition

Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pp.\ 647--655, 2014

work page 2014
[16]

T., Riedmiller, M., and Brox, T

Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems, pp.\ 766--774, 2014

work page 2014
[17]

K., Winn, J., and Zisserman, A

Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88 0 (2): 0 303--338, 2010

work page 2010
[18]

Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Generative-Model Based Vision, 2004

work page 2004
[19]

Unsupervised representation learning by predicting image rotations

Gidaris, S., Singh, P., and Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018

work page arXiv 2018
[20]

Generative adversarial nets

Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp.\ 2672--2680, 2014

work page 2014
[21]

Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

Goyal, P., Doll \'a r, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[22]

Dimensionality reduction by learning an invariant mapping

Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), volume 2, pp.\ 1735--1742. IEEE, 2006

work page 2006
[23]

Deep residual learning for image recognition

He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

work page 2016
[24]

Momentum contrast for unsupervised visual representation learning

He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019

work page arXiv 1911
[25]

J., Razavi, A., Doersch, C., Eslami, S., and Oord, A

H \'e naff, O. J., Razavi, A., Doersch, C., Eslami, S., and Oord, A. v. d. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019

work page arXiv 1905
[26]

E., Osindero, S., and Teh, Y.-W

Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural computation, 18 0 (7): 0 1527--1554, 2006

work page 2006
[27]

Learning deep representations by mutual information estimation and maximization

Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018

work page Pith review arXiv 2018
[28]

Howard, A. G. Some improvements on deep convolutional neural network based image classification. arXiv preprint arXiv:1312.5402, 2013

work page arXiv 2013
[29]

Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

work page internal anchor Pith review Pith/arXiv arXiv 2015
[30]

F., and Vedaldi, A

Ji, X., Henriques, J. F., and Vedaldi, A. Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 9865--9874, 2019

work page 2019
[31]

Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[32]

Revisiting self-supervised visual representation learning

Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.\ 1920--1929, 2019

work page 1920
[33]

Kornblith, S., Shlens, J., and Le, Q. V. Do better ImageNet models transfer better? In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 2661--2671, 2019

work page 2019
[34]

Collecting a large-scale dataset of fine-grained cars

Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a large-scale dataset of fine-grained cars. In Second Workshop on Fine-Grained Visual Categorization, 2013

work page 2013
[35]

and Hinton, G

Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URL https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf

work page 2009
[36]

Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012

work page 2012
[37]

SGDR: Stochastic Gradient Descent with Warm Restarts

Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

work page internal anchor Pith review Pith/arXiv arXiv 2016
[38]

Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9 0 (Nov): 0 2579--2605, 2008

work page 2008
[39]

Fine-grained visual classification of aircraft

Maji, S., Kannala, J., Rahtu, E., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft. Technical report, 2013

work page 2013
[40]

Efficient Estimation of Word Representations in Vector Space

Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013

work page internal anchor Pith review Pith/arXiv arXiv 2013
[41]

and van der Maaten, L

Misra, I. and van der Maaten, L. Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991, 2019

work page arXiv 1912
[42]

and Zisserman, A

Nilsback, M.-E. and Zisserman, A. Automated flower classification over a large number of classes. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP'08. Sixth Indian Conference on, pp.\ 722--729. IEEE, 2008

work page 2008
[43]

and Favaro, P

Noroozi, M. and Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp.\ 69--84. Springer, 2016

work page 2016
[44]

Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[45]

M., Vedaldi, A., Zisserman, A., and Jawahar, C

Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3498--3505. IEEE, 2012

work page 2012
[46]

Imagenet large scale visual recognition challenge

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115 0 (3): 0 211--252, 2015

work page 2015
[47]

Facenet: A unified embedding for face recognition and clustering

Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 815--823, 2015

work page 2015
[48]

Very Deep Convolutional Networks for Large-Scale Image Recognition

Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

work page internal anchor Pith review Pith/arXiv arXiv 2014
[49]

Improved deep metric learning with multi-class n-pair loss objective

Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems, pp.\ 1857--1865, 2016

work page 2016
[50]

D., Kurakin, A., Zhang, H., and Raffel, C

Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685, 2020

work page arXiv 2001
[51]

Going deeper with convolutions

Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 1--9, 2015

work page 2015
[52]

arXiv preprint arXiv:1906.05849 , year=

Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019

work page arXiv 1906
[53]

K., GELLY, S., LUCIC, M

Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625, 2019

work page arXiv 1907
[54]

X., and Lin, D

Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 3733--3742, 2018

work page 2018
[55]

A., Oliva, A., and Torralba, A

Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3485--3492. IEEE, 2010

work page 2010
[56]

Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019

work page arXiv 1904
[57]

C., and Chang, S.-F

Ye, M., Zhang, X., Yuen, P. C., and Chang, S.-F. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 6210--6219, 2019

work page 2019
[58]

Large Batch Training of Convolutional Networks

You, Y., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017

work page Pith review arXiv 2017
[59]

S4l: Self-supervised semi-supervised learning

Zhai, X., Oliver, A., Kolesnikov, A., and Beyer, L. S4l: Self-supervised semi-supervised learning. In The IEEE International Conference on Computer Vision (ICCV), October 2019

work page 2019
[60]

Zhang, R., Isola, P., and Efros, A. A. Colorful image colorization. In European conference on computer vision, pp.\ 649--666. Springer, 2016

work page 2016
[61]

L., and Yamins, D

Zhuang, C., Zhai, A. L., and Yamins, D. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 6002--6012, 2019

work page 2019