pith. machine review for the scientific record. sign in

arxiv: 2002.05709 · v3 · submitted 2020-02-13 · 💻 cs.LG · cs.CV· stat.ML

Recognition: 2 theorem links

· Lean Theorem

A Simple Framework for Contrastive Learning of Visual Representations

Authors on Pith no claims yet

Pith reviewed 2026-05-13 18:28 UTC · model grok-4.3

classification 💻 cs.LG cs.CVstat.ML
keywords contrastive learningself-supervised learningvisual representationsdata augmentationImageNetprojection headlinear probe
0
0 comments X

The pith

A contrastive self-supervised framework learns ImageNet representations that match a supervised ResNet-50.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces SimCLR as a streamlined contrastive learning approach for visual representations that dispenses with memory banks or specialized networks. It systematically tests framework components and finds that particular data-augmentation combinations, a learnable nonlinear projection head, and larger batch sizes together drive strong performance. When a linear classifier is trained on top of the resulting features, top-1 accuracy reaches 76.5 percent on ImageNet, a 7 percent relative gain over prior self-supervised methods and on par with fully supervised ResNet-50. Readers should care because the work shows that simple, scalable contrastive objectives can close most of the gap with supervised learning while using no labels for the representation stage.

Core claim

By using two randomly augmented views of each image as a positive pair and all other images in the batch as negatives, SimCLR trains an encoder followed by a nonlinear projection head under a contrastive loss; the resulting representations, when evaluated with a linear classifier, reach 76.5 percent top-1 accuracy on ImageNet and 85.8 percent top-5 accuracy when fine-tuned on only 1 percent of the labels.

What carries the argument

The SimCLR contrastive prediction task, which treats two augmented views of the same image as positives and uses a learnable nonlinear projection head to map representations into the space where the loss is applied.

If this is right

  • Contrastive learning benefits more from very large batch sizes and longer training than supervised classification does.
  • Effective predictive tasks arise mainly from composing multiple data augmentations rather than from any single transform.
  • Inserting a nonlinear projection head between the representation and the contrastive loss measurably improves downstream linear-probe accuracy.
  • The same representations support strong semi-supervised fine-tuning when only 1 percent of ImageNet labels are available.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same augmentation-plus-projection recipe could be tested on video or audio to see whether the same scaling laws hold outside static images.
  • If the method continues to improve with even larger batches and longer training, self-supervised pretraining might become the default first stage for most vision pipelines.
  • The framework's simplicity suggests it can serve as a reproducible baseline for studying how much further contrastive objectives can be pushed without architectural novelty.

Load-bearing premise

That the particular choices of data-augmentation composition and the nonlinear projection head are the main sources of the observed gains rather than interactions with untested optimizer or architecture details.

What would settle it

Retraining the identical encoder and loss with either a single augmentation policy or a linear projection head and observing that top-1 linear-probe accuracy falls below 70 percent on ImageNet would falsify the claimed importance of those two components.

read the original abstract

This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces SimCLR, a simplified contrastive self-supervised learning framework for visual representations. It removes the need for specialized architectures or memory banks, relying instead on standard ResNet backbones, a composition of data augmentations, a learnable nonlinear projection head, and the NT-Xent contrastive loss. Systematic ablations isolate the contributions of augmentation composition, the projection head, and large batch sizes. The central empirical result is that a linear classifier trained on the learned representations achieves 76.5% top-1 accuracy on ImageNet, a 7% relative improvement over prior self-supervised methods and matching the performance of a supervised ResNet-50; semi-supervised fine-tuning with 1% labels reaches 85.8% top-5 accuracy.

Significance. If the results hold, this work is significant because it demonstrates that a simple contrastive recipe can match supervised baselines on a large-scale benchmark while supplying ablation evidence that clarifies the roles of data augmentation and the projection head. These insights have shaped subsequent representation-learning research. The manuscript supplies exact training protocols and ablation tables, supporting reproducibility of the headline numbers.

major comments (1)
  1. [Experiments] Experiments section: batch size and temperature are tuned on the ImageNet validation set that is also used to report the final 76.5% top-1 accuracy. This creates a risk that the headline number reflects hyperparameter overfitting to the evaluation distribution rather than a robust improvement; a separate tuning split or cross-validation protocol would be needed to confirm the claim.
minor comments (2)
  1. [Figure 2] The caption of the main framework figure would be clearer if it explicitly distinguished the representation vector h from the projected vector z and indicated where the contrastive loss is computed.
  2. [§3.2] §3.2: while the NT-Xent loss is standard, writing its explicit normalized-temperature cross-entropy formula would improve self-contained readability.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive review and the recommendation for minor revision. We address the single major comment below.

read point-by-point responses
  1. Referee: [Experiments] Experiments section: batch size and temperature are tuned on the ImageNet validation set that is also used to report the final 76.5% top-1 accuracy. This creates a risk that the headline number reflects hyperparameter overfitting to the evaluation distribution rather than a robust improvement; a separate tuning split or cross-validation protocol would be needed to confirm the claim.

    Authors: We appreciate the referee's observation. Hyperparameters including batch size and temperature were indeed selected using the ImageNet validation set, following the standard protocol for linear evaluation on this benchmark (where test labels remain unavailable). However, the manuscript's primary claims rest on systematic ablations that isolate the effects of augmentation composition and the nonlinear projection head; these trends hold across wide ranges of batch sizes (256 to 8192) and temperatures (0.05 to 0.5) without requiring the final reported configuration. The headline 76.5% result is also consistent with the supervised ResNet-50 baseline under identical evaluation, which itself uses the same validation set. In the revised manuscript we will add an explicit paragraph in Section 4 clarifying the hyperparameter selection process, noting the common practice in the field, and stating that the core architectural and augmentation insights were validated independently of the final hyperparameter choice. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected

full rationale

The manuscript is an empirical study introducing the SimCLR framework for contrastive self-supervised learning. All reported performance figures (e.g., 76.5% linear top-1 on ImageNet validation) are obtained by training the described recipe and measuring accuracy on held-out data against external baselines. No equations, predictions, or first-principles derivations appear that reduce to fitted inputs or self-citations by construction. Ablation tables isolate the effects of augmentations, projection head, and batch size without circular re-use of the target metric. Self-citations to prior contrastive work are not load-bearing for the central empirical claims.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The framework rests on standard contrastive-learning assumptions plus a small number of fitted hyperparameters; no new physical entities are postulated.

free parameters (2)
  • temperature parameter in NT-Xent loss
    Controls the sharpness of the contrastive distribution; chosen via validation performance.
  • batch size
    Shown to improve performance when increased; selected as a large value (thousands) based on experiments.
axioms (1)
  • domain assumption Maximizing agreement between positive pairs while contrasting against negatives yields useful representations.
    Core premise of the contrastive prediction task, stated in the framework description.

pith-pipeline@v0.9.0 · 5517 in / 1329 out tokens · 49727 ms · 2026-05-13T18:28:00.618751+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 25 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning

    cs.LG 2026-05 unverdicted novelty 7.0

    SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude ...

  2. Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors

    cs.LG 2026-04 unverdicted novelty 7.0

    NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.

  3. Self-Directed Task Identification

    cs.LG 2026-04 unverdicted novelty 7.0

    SDTI lets models identify the correct target variable in datasets in a zero-shot setting using standard neural networks, beating baselines by 14% F1 on synthetic benchmarks.

  4. BEiT: BERT Pre-Training of Image Transformers

    cs.CV 2021-06 conditional novelty 7.0

    BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.

  5. Mastering Atari with Discrete World Models

    cs.LG 2020-10 accept novelty 7.0

    DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.

  6. Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization

    q-bio.NC 2026-05 unverdicted novelty 6.0

    Direction maps and pinwheel structures in MT emerge spontaneously when a spatiotemporal deep network is trained on videos with contrastive self-supervised learning and spatial regularization.

  7. CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking

    cs.CV 2026-05 unverdicted novelty 6.0

    CalibFree enables calibration-free multi-camera tracking via self-supervised feature separation through single-view distillation and cross-view reconstruction, reporting 3% higher accuracy and 7.5% better F1 on tested...

  8. An Interpretable and Scalable Framework for Evaluating Large Language Models

    stat.ML 2026-05 unverdicted novelty 6.0

    A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.

  9. Velox: Learning Representations of 4D Geometry and Appearance

    cs.CV 2026-05 unverdicted novelty 6.0

    Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...

  10. ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching

    cs.CV 2026-04 unverdicted novelty 6.0

    ShapeY is a benchmark dataset and nearest-neighbor protocol that measures shape-based recognition in vision models, revealing that even state-of-the-art networks fail to generalize consistently across 3D viewpoints an...

  11. StarCLR: Contrastive Learning Representation for Astronomical Light Curves

    astro-ph.SR 2026-04 conditional novelty 6.0

    StarCLR pretrains on TESS light curves via contrastive learning on overlapping subsequences and improves variable star classification F1 scores over scratch-trained models when fine-tuned on TESS, ZTF, and Gaia.

  12. Only Brains Align with Brains: Cross-Region Alignment Patterns Expose Limits of Normative Models

    q-bio.NC 2026-04 unverdicted novelty 6.0

    Alignment pattern analysis reveals that models aligned to individual brain ROIs do not reproduce the stable cross-region alignment profiles observed across human subjects.

  13. Self-supervised Pretraining of Cell Segmentation Models

    cs.CV 2026-04 unverdicted novelty 6.0

    DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.

  14. Zero-shot World Models Are Developmentally Efficient Learners

    cs.AI 2026-04 unverdicted novelty 6.0

    A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.

  15. Masked Contrastive Pre-Training Improves Music Audio Key Detection

    cs.SD 2026-04 unverdicted novelty 6.0

    Masked contrastive pre-training on music audio yields representations that achieve SOTA key detection performance in the supervised setting without sophisticated augmentations.

  16. Revisiting Feature Prediction for Learning Visual Representations from Video

    cs.CV 2024-02 conditional novelty 6.0

    V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.

  17. Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model

    cs.LG 2026-05 unverdicted novelty 5.0

    A contrastive-learning ECG foundation model with multitask heads predicts post-MI outcomes better than training from scratch (AUC 0.794 vs 0.608).

  18. Information theoretic underpinning of self-supervised learning by clustering

    cs.LG 2026-05 unverdicted novelty 5.0

    SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.

  19. Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling

    cs.LG 2026-04 unverdicted novelty 5.0

    Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.

  20. OC-Distill: Ontology-aware Contrastive Learning with Cross-Modal Distillation for ICU Risk Prediction

    cs.LG 2026-04 unverdicted novelty 5.0

    OC-Distill combines ontology-aware contrastive pretraining with cross-modal distillation to improve ICU risk prediction performance and label efficiency while using only vital signs at inference.

  21. Improved Baselines with Momentum Contrastive Learning

    cs.CV 2020-03 accept novelty 5.0

    Adding an MLP projection head and enhanced augmentations to MoCo produces stronger unsupervised vision baselines that beat SimCLR while using smaller batches.

  22. Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection

    eess.AS 2026-04 unverdicted novelty 4.0

    Cosine similarity in SupCon with a delayed negative queue on wav2vec2 XLS-R yields the lowest equal error rates for deepfake audio detection on in-the-wild and pooled evaluations.

  23. LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning

    cs.SE 2026-04 unverdicted novelty 4.0

    LLMSniffer improves detection of LLM-generated code on GPTSniffer and Whodunit benchmarks by fine-tuning GraphCodeBERT via two-stage supervised contrastive learning plus preprocessing and MLP classification.

  24. Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization

    cs.CV 2026-04 unverdicted novelty 4.0

    DINO-based ViT models pretrained on HPA FOV achieve macro F1 of 0.822 zero-shot and 0.860 after fine-tuning for protein localization on OpenCell, demonstrating effective transfer from SSL pretraining.

  25. LLMs Struggle with Abstract Meaning Comprehension More Than Expected

    cs.CL 2026-04 unverdicted novelty 3.0

    LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.

Reference graph

Works this paper leans on

61 extracted references · 61 canonical work pages · cited by 25 Pith papers · 8 internal anchors

  1. [1]

    M., Rupprecht, C., and Vedaldi, A

    Asano, Y. M., Rupprecht, C., and Vedaldi, A. A critical analysis of self-supervision, or what we can learn from a single image. arXiv preprint arXiv:1904.13132, 2019

  2. [2]

    D., and Buchwalter, W

    Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp.\ 15509--15519, 2019

  3. [3]

    and Hinton, G

    Becker, S. and Hinton, G. E. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355 0 (6356): 0 161--163, 1992

  4. [4]

    W., Alexander, M

    Berg, T., Liu, J., Lee, S. W., Alexander, M. L., Jacobs, D. W., and Belhumeur, P. N. Birdsnap: Large-scale fine-grained visual categorization of birds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2019--2026. IEEE, 2014

  5. [5]

    Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. A. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp.\ 5050--5060, 2019

  6. [6]

    Food-101--mining discriminative components with random forests

    Bossard, L., Guillaumin, M., and Van Gool, L. Food-101--mining discriminative components with random forests. In European conference on computer vision, pp.\ 446--461. Springer, 2014

  7. [7]

    On sampling strategies for neural network-based collaborative filtering

    Chen, T., Sun, Y., Shi, Y., and Hong, L. On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.\ 767--776, 2017

  8. [8]

    Self-supervised gans via auxiliary rotation loss

    Chen, T., Zhai, X., Ritter, M., Lucic, M., and Houlsby, N. Self-supervised gans via auxiliary rotation loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 12154--12163, 2019

  9. [9]

    Describing textures in the wild

    Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and Vedaldi, A. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3606--3613. IEEE, 2014

  10. [10]

    D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q

    Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 113--123, 2019

  11. [11]

    Improved Regularization of Convolutional Neural Networks with Cutout

    DeVries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017

  12. [12]

    and Zisserman, A

    Doersch, C. and Zisserman, A. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 2051--2060, 2017

  13. [13]

    Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 1422--1430, 2015

  14. [14]

    and Simonyan, K

    Donahue, J. and Simonyan, K. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp.\ 10541--10551, 2019

  15. [15]

    Decaf: A deep convolutional activation feature for generic visual recognition

    Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pp.\ 647--655, 2014

  16. [16]

    T., Riedmiller, M., and Brox, T

    Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems, pp.\ 766--774, 2014

  17. [17]

    K., Winn, J., and Zisserman, A

    Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88 0 (2): 0 303--338, 2010

  18. [18]

    Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories

    Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Generative-Model Based Vision, 2004

  19. [19]

    Unsupervised representation learning by predicting image rotations

    Gidaris, S., Singh, P., and Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018

  20. [20]

    Generative adversarial nets

    Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp.\ 2672--2680, 2014

  21. [21]

    Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour

    Goyal, P., Doll \'a r, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017

  22. [22]

    Dimensionality reduction by learning an invariant mapping

    Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), volume 2, pp.\ 1735--1742. IEEE, 2006

  23. [23]

    Deep residual learning for image recognition

    He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016

  24. [24]

    Momentum contrast for unsupervised visual representation learning

    He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019

  25. [25]

    J., Razavi, A., Doersch, C., Eslami, S., and Oord, A

    H \'e naff, O. J., Razavi, A., Doersch, C., Eslami, S., and Oord, A. v. d. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019

  26. [26]

    E., Osindero, S., and Teh, Y.-W

    Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural computation, 18 0 (7): 0 1527--1554, 2006

  27. [27]

    Learning deep representations by mutual information estimation and maximization

    Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018

  28. [28]

    Howard, A. G. Some improvements on deep convolutional neural network based image classification. arXiv preprint arXiv:1312.5402, 2013

  29. [29]

    Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift

    Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015

  30. [30]

    F., and Vedaldi, A

    Ji, X., Henriques, J. F., and Vedaldi, A. Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 9865--9874, 2019

  31. [31]

    Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013

  32. [32]

    Revisiting self-supervised visual representation learning

    Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.\ 1920--1929, 2019

  33. [33]

    Kornblith, S., Shlens, J., and Le, Q. V. Do better ImageNet models transfer better? In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 2661--2671, 2019

  34. [34]

    Collecting a large-scale dataset of fine-grained cars

    Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a large-scale dataset of fine-grained cars. In Second Workshop on Fine-Grained Visual Categorization, 2013

  35. [35]

    and Hinton, G

    Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URL https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf

  36. [36]

    Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012

  37. [37]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016

  38. [38]

    Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9 0 (Nov): 0 2579--2605, 2008

  39. [39]

    Fine-grained visual classification of aircraft

    Maji, S., Kannala, J., Rahtu, E., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft. Technical report, 2013

  40. [40]

    Efficient Estimation of Word Representations in Vector Space

    Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013

  41. [41]

    and van der Maaten, L

    Misra, I. and van der Maaten, L. Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991, 2019

  42. [42]

    and Zisserman, A

    Nilsback, M.-E. and Zisserman, A. Automated flower classification over a large number of classes. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP'08. Sixth Indian Conference on, pp.\ 722--729. IEEE, 2008

  43. [43]

    and Favaro, P

    Noroozi, M. and Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp.\ 69--84. Springer, 2016

  44. [44]

    Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018

  45. [45]

    M., Vedaldi, A., Zisserman, A., and Jawahar, C

    Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3498--3505. IEEE, 2012

  46. [46]

    Imagenet large scale visual recognition challenge

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115 0 (3): 0 211--252, 2015

  47. [47]

    Facenet: A unified embedding for face recognition and clustering

    Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 815--823, 2015

  48. [48]

    Very Deep Convolutional Networks for Large-Scale Image Recognition

    Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014

  49. [49]

    Improved deep metric learning with multi-class n-pair loss objective

    Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems, pp.\ 1857--1865, 2016

  50. [50]

    D., Kurakin, A., Zhang, H., and Raffel, C

    Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685, 2020

  51. [51]

    Going deeper with convolutions

    Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 1--9, 2015

  52. [52]

    arXiv preprint arXiv:1906.05849 , year=

    Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019

  53. [53]

    K., GELLY, S., LUCIC, M

    Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625, 2019

  54. [54]

    X., and Lin, D

    Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 3733--3742, 2018

  55. [55]

    A., Oliva, A., and Torralba, A

    Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3485--3492. IEEE, 2010

  56. [56]

    Xie, Q., Dai, Z., Hovy, E., Luong, M.-T., and Le, Q. V. Unsupervised data augmentation. arXiv preprint arXiv:1904.12848, 2019

  57. [57]

    C., and Chang, S.-F

    Ye, M., Zhang, X., Yuen, P. C., and Chang, S.-F. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 6210--6219, 2019

  58. [58]

    Large Batch Training of Convolutional Networks

    You, Y., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017

  59. [59]

    S4l: Self-supervised semi-supervised learning

    Zhai, X., Oliver, A., Kolesnikov, A., and Beyer, L. S4l: Self-supervised semi-supervised learning. In The IEEE International Conference on Computer Vision (ICCV), October 2019

  60. [60]

    Zhang, R., Isola, P., and Efros, A. A. Colorful image colorization. In European conference on computer vision, pp.\ 649--666. Springer, 2016

  61. [61]

    L., and Yamins, D

    Zhuang, C., Zhai, A. L., and Yamins, D. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 6002--6012, 2019