Recognition: 2 theorem links
· Lean TheoremA Simple Framework for Contrastive Learning of Visual Representations
Pith reviewed 2026-05-13 18:28 UTC · model grok-4.3
The pith
A contrastive self-supervised framework learns ImageNet representations that match a supervised ResNet-50.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By using two randomly augmented views of each image as a positive pair and all other images in the batch as negatives, SimCLR trains an encoder followed by a nonlinear projection head under a contrastive loss; the resulting representations, when evaluated with a linear classifier, reach 76.5 percent top-1 accuracy on ImageNet and 85.8 percent top-5 accuracy when fine-tuned on only 1 percent of the labels.
What carries the argument
The SimCLR contrastive prediction task, which treats two augmented views of the same image as positives and uses a learnable nonlinear projection head to map representations into the space where the loss is applied.
If this is right
- Contrastive learning benefits more from very large batch sizes and longer training than supervised classification does.
- Effective predictive tasks arise mainly from composing multiple data augmentations rather than from any single transform.
- Inserting a nonlinear projection head between the representation and the contrastive loss measurably improves downstream linear-probe accuracy.
- The same representations support strong semi-supervised fine-tuning when only 1 percent of ImageNet labels are available.
Where Pith is reading between the lines
- The same augmentation-plus-projection recipe could be tested on video or audio to see whether the same scaling laws hold outside static images.
- If the method continues to improve with even larger batches and longer training, self-supervised pretraining might become the default first stage for most vision pipelines.
- The framework's simplicity suggests it can serve as a reproducible baseline for studying how much further contrastive objectives can be pushed without architectural novelty.
Load-bearing premise
That the particular choices of data-augmentation composition and the nonlinear projection head are the main sources of the observed gains rather than interactions with untested optimizer or architecture details.
What would settle it
Retraining the identical encoder and loss with either a single augmentation policy or a linear projection head and observing that top-1 linear-probe accuracy falls below 70 percent on ImageNet would falsify the claimed importance of those two components.
read the original abstract
This paper presents SimCLR: a simple framework for contrastive learning of visual representations. We simplify recently proposed contrastive self-supervised learning algorithms without requiring specialized architectures or a memory bank. In order to understand what enables the contrastive prediction tasks to learn useful representations, we systematically study the major components of our framework. We show that (1) composition of data augmentations plays a critical role in defining effective predictive tasks, (2) introducing a learnable nonlinear transformation between the representation and the contrastive loss substantially improves the quality of the learned representations, and (3) contrastive learning benefits from larger batch sizes and more training steps compared to supervised learning. By combining these findings, we are able to considerably outperform previous methods for self-supervised and semi-supervised learning on ImageNet. A linear classifier trained on self-supervised representations learned by SimCLR achieves 76.5% top-1 accuracy, which is a 7% relative improvement over previous state-of-the-art, matching the performance of a supervised ResNet-50. When fine-tuned on only 1% of the labels, we achieve 85.8% top-5 accuracy, outperforming AlexNet with 100X fewer labels.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces SimCLR, a simplified contrastive self-supervised learning framework for visual representations. It removes the need for specialized architectures or memory banks, relying instead on standard ResNet backbones, a composition of data augmentations, a learnable nonlinear projection head, and the NT-Xent contrastive loss. Systematic ablations isolate the contributions of augmentation composition, the projection head, and large batch sizes. The central empirical result is that a linear classifier trained on the learned representations achieves 76.5% top-1 accuracy on ImageNet, a 7% relative improvement over prior self-supervised methods and matching the performance of a supervised ResNet-50; semi-supervised fine-tuning with 1% labels reaches 85.8% top-5 accuracy.
Significance. If the results hold, this work is significant because it demonstrates that a simple contrastive recipe can match supervised baselines on a large-scale benchmark while supplying ablation evidence that clarifies the roles of data augmentation and the projection head. These insights have shaped subsequent representation-learning research. The manuscript supplies exact training protocols and ablation tables, supporting reproducibility of the headline numbers.
major comments (1)
- [Experiments] Experiments section: batch size and temperature are tuned on the ImageNet validation set that is also used to report the final 76.5% top-1 accuracy. This creates a risk that the headline number reflects hyperparameter overfitting to the evaluation distribution rather than a robust improvement; a separate tuning split or cross-validation protocol would be needed to confirm the claim.
minor comments (2)
- [Figure 2] The caption of the main framework figure would be clearer if it explicitly distinguished the representation vector h from the projected vector z and indicated where the contrastive loss is computed.
- [§3.2] §3.2: while the NT-Xent loss is standard, writing its explicit normalized-temperature cross-entropy formula would improve self-contained readability.
Simulated Author's Rebuttal
We thank the referee for the constructive review and the recommendation for minor revision. We address the single major comment below.
read point-by-point responses
-
Referee: [Experiments] Experiments section: batch size and temperature are tuned on the ImageNet validation set that is also used to report the final 76.5% top-1 accuracy. This creates a risk that the headline number reflects hyperparameter overfitting to the evaluation distribution rather than a robust improvement; a separate tuning split or cross-validation protocol would be needed to confirm the claim.
Authors: We appreciate the referee's observation. Hyperparameters including batch size and temperature were indeed selected using the ImageNet validation set, following the standard protocol for linear evaluation on this benchmark (where test labels remain unavailable). However, the manuscript's primary claims rest on systematic ablations that isolate the effects of augmentation composition and the nonlinear projection head; these trends hold across wide ranges of batch sizes (256 to 8192) and temperatures (0.05 to 0.5) without requiring the final reported configuration. The headline 76.5% result is also consistent with the supervised ResNet-50 baseline under identical evaluation, which itself uses the same validation set. In the revised manuscript we will add an explicit paragraph in Section 4 clarifying the hyperparameter selection process, noting the common practice in the field, and stating that the core architectural and augmentation insights were validated independently of the final hyperparameter choice. revision: yes
Circularity Check
No significant circularity detected
full rationale
The manuscript is an empirical study introducing the SimCLR framework for contrastive self-supervised learning. All reported performance figures (e.g., 76.5% linear top-1 on ImageNet validation) are obtained by training the described recipe and measuring accuracy on held-out data against external baselines. No equations, predictions, or first-principles derivations appear that reduce to fitted inputs or self-citations by construction. Ablation tables isolate the effects of augmentations, projection head, and batch size without circular re-use of the target metric. Self-citations to prior contrastive work are not load-bearing for the central empirical claims.
Axiom & Free-Parameter Ledger
free parameters (2)
- temperature parameter in NT-Xent loss
- batch size
axioms (1)
- domain assumption Maximizing agreement between positive pairs while contrasting against negatives yields useful representations.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
ℓ_{i,j} = −log [exp(sim(z_i,z_j)/τ) / ∑ exp(sim(z_i,z_k)/τ)] with sim = cosine similarity
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
linear classifier on frozen ResNet-50 representations yields 76.5% top-1
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 25 Pith papers
-
SMA: Submodular Modality Aligner For Data Efficient Multimodal Learning
SMA uses a submodular mutual information objective on data sets to deliver competitive zero-shot classification and retrieval performance on CLIP benchmarks with only tens of thousands of samples, orders of magnitude ...
-
Learning Posterior Predictive Distributions for Node Classification from Synthetic Graph Priors
NodePFN pre-trains on synthetic graphs with controllable homophily and causal feature-label models to achieve 71.27 average accuracy on 23 node classification benchmarks without graph-specific training.
-
Self-Directed Task Identification
SDTI lets models identify the correct target variable in datasets in a zero-shot setting using standard neural networks, beating baselines by 14% F1 on synthetic benchmarks.
-
BEiT: BERT Pre-Training of Image Transformers
BEiT pre-trains vision transformers via masked image modeling on visual tokens and reaches 83.2% ImageNet top-1 accuracy for the base model and 86.3% for the large model using only ImageNet-1K data.
-
Mastering Atari with Discrete World Models
DreamerV2 reaches human-level performance on 55 Atari games by learning behaviors inside a separately trained discrete-latent world model.
-
Self-organized MT Direction Maps Emerge from Spatiotemporal Contrastive Optimization
Direction maps and pinwheel structures in MT emerge spontaneously when a spatiotemporal deep network is trained on videos with contrastive self-supervised learning and spatial regularization.
-
CalibFree: Self-Supervised View Feature Separation for Calibration-Free Multi-Camera Multi-Object Tracking
CalibFree enables calibration-free multi-camera tracking via self-supervised feature separation through single-view distillation and cross-view reconstruction, reporting 3% higher accuracy and 7.5% better F1 on tested...
-
An Interpretable and Scalable Framework for Evaluating Large Language Models
A majorization-minimization framework turns IRT into scalable matrix factorization subproblems for LLM evaluation, delivering orders-of-magnitude speedups with identifiability guarantees.
-
Velox: Learning Representations of 4D Geometry and Appearance
Velox compresses dynamic point clouds into latent tokens that support geometry via 4D surface modeling and appearance via 3D Gaussians, showing strong results on video-to-4D generation, tracking, and image-to-4D cloth...
-
ShapeY: A Principled Framework for Measuring Shape Recognition Capacity via Nearest-Neighbor Matching
ShapeY is a benchmark dataset and nearest-neighbor protocol that measures shape-based recognition in vision models, revealing that even state-of-the-art networks fail to generalize consistently across 3D viewpoints an...
-
StarCLR: Contrastive Learning Representation for Astronomical Light Curves
StarCLR pretrains on TESS light curves via contrastive learning on overlapping subsequences and improves variable star classification F1 scores over scratch-trained models when fine-tuned on TESS, ZTF, and Gaia.
-
Only Brains Align with Brains: Cross-Region Alignment Patterns Expose Limits of Normative Models
Alignment pattern analysis reveals that models aligned to individual brain ROIs do not reproduce the stable cross-region alignment profiles observed across human subjects.
-
Self-supervised Pretraining of Cell Segmentation Models
DINOCell achieves a SEG score of 0.784 on LIVECell by self-supervised domain adaptation of DINOv2, improving 10.42% over SAM-based models and showing strong zero-shot transfer.
-
Zero-shot World Models Are Developmentally Efficient Learners
A zero-shot visual world model trained on one child's experience achieves broad competence on physical understanding benchmarks while matching developmental behavioral patterns.
-
Masked Contrastive Pre-Training Improves Music Audio Key Detection
Masked contrastive pre-training on music audio yields representations that achieve SOTA key detection performance in the supervised setting without sophisticated augmentations.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Dynamical Predictive Modelling of Cardiovascular Disease Progression Post-Myocardial Infarction via ECG-Trained Artificial Intelligence Model
A contrastive-learning ECG foundation model with multitask heads predicts post-MI outcomes better than training from scratch (AUC 0.794 vs 0.608).
-
Information theoretic underpinning of self-supervised learning by clustering
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
-
Handling and Interpreting Missing Modalities in Patient Clinical Trajectories via Autoregressive Sequence Modeling
Autoregressive transformer modeling with missingness-aware contrastive pre-training outperforms baselines on MIMIC-IV and eICU benchmarks and mitigates divergent behavior from removed modalities in clinical trajectories.
-
OC-Distill: Ontology-aware Contrastive Learning with Cross-Modal Distillation for ICU Risk Prediction
OC-Distill combines ontology-aware contrastive pretraining with cross-modal distillation to improve ICU risk prediction performance and label efficiency while using only vital signs at inference.
-
Improved Baselines with Momentum Contrastive Learning
Adding an MLP projection head and enhanced augmentations to MoCo produces stronger unsupervised vision baselines that beat SimCLR while using smaller batches.
-
Similarity Choice and Negative Scaling in Supervised Contrastive Learning for Deepfake Audio Detection
Cosine similarity in SupCon with a delayed negative queue on wav2vec2 XLS-R yields the lowest equal error rates for deepfake audio detection on in-the-wild and pooled evaluations.
-
LLMSniffer: Detecting LLM-Generated Code via GraphCodeBERT and Supervised Contrastive Learning
LLMSniffer improves detection of LLM-generated code on GPTSniffer and Whodunit benchmarks by fine-tuning GraphCodeBERT via two-stage supervised contrastive learning plus preprocessing and MLP classification.
-
Using Deep Learning Models Pretrained by Self-Supervised Learning for Protein Localization
DINO-based ViT models pretrained on HPA FOV achieve macro F1 of 0.822 zero-shot and 0.860 after fine-tuning for protein localization on OpenCell, demonstrating effective transfer from SSL pretraining.
-
LLMs Struggle with Abstract Meaning Comprehension More Than Expected
LLMs struggle with abstract meaning comprehension on SemEval-2021 Task 4 more than fine-tuned models, and a new bidirectional attention classifier yields small accuracy gains of 3-4%.
Reference graph
Works this paper leans on
-
[1]
M., Rupprecht, C., and Vedaldi, A
Asano, Y. M., Rupprecht, C., and Vedaldi, A. A critical analysis of self-supervision, or what we can learn from a single image. arXiv preprint arXiv:1904.13132, 2019
-
[2]
Bachman, P., Hjelm, R. D., and Buchwalter, W. Learning representations by maximizing mutual information across views. In Advances in Neural Information Processing Systems, pp.\ 15509--15519, 2019
work page 2019
-
[3]
Becker, S. and Hinton, G. E. Self-organizing neural network that discovers surfaces in random-dot stereograms. Nature, 355 0 (6356): 0 161--163, 1992
work page 1992
-
[4]
Berg, T., Liu, J., Lee, S. W., Alexander, M. L., Jacobs, D. W., and Belhumeur, P. N. Birdsnap: Large-scale fine-grained visual categorization of birds. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 2019--2026. IEEE, 2014
work page 2019
-
[5]
Berthelot, D., Carlini, N., Goodfellow, I., Papernot, N., Oliver, A., and Raffel, C. A. Mixmatch: A holistic approach to semi-supervised learning. In Advances in Neural Information Processing Systems, pp.\ 5050--5060, 2019
work page 2019
-
[6]
Food-101--mining discriminative components with random forests
Bossard, L., Guillaumin, M., and Van Gool, L. Food-101--mining discriminative components with random forests. In European conference on computer vision, pp.\ 446--461. Springer, 2014
work page 2014
-
[7]
On sampling strategies for neural network-based collaborative filtering
Chen, T., Sun, Y., Shi, Y., and Hong, L. On sampling strategies for neural network-based collaborative filtering. In Proceedings of the 23rd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining, pp.\ 767--776, 2017
work page 2017
-
[8]
Self-supervised gans via auxiliary rotation loss
Chen, T., Zhai, X., Ritter, M., Lucic, M., and Houlsby, N. Self-supervised gans via auxiliary rotation loss. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 12154--12163, 2019
work page 2019
-
[9]
Describing textures in the wild
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., and Vedaldi, A. Describing textures in the wild. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3606--3613. IEEE, 2014
work page 2014
-
[10]
D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q
Cubuk, E. D., Zoph, B., Mane, D., Vasudevan, V., and Le, Q. V. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 113--123, 2019
work page 2019
-
[11]
Improved Regularization of Convolutional Neural Networks with Cutout
DeVries, T. and Taylor, G. W. Improved regularization of convolutional neural networks with cutout. arXiv preprint arXiv:1708.04552, 2017
work page internal anchor Pith review arXiv 2017
-
[12]
Doersch, C. and Zisserman, A. Multi-task self-supervised visual learning. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 2051--2060, 2017
work page 2051
-
[13]
Doersch, C., Gupta, A., and Efros, A. A. Unsupervised visual representation learning by context prediction. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 1422--1430, 2015
work page 2015
-
[14]
Donahue, J. and Simonyan, K. Large scale adversarial representation learning. In Advances in Neural Information Processing Systems, pp.\ 10541--10551, 2019
work page 2019
-
[15]
Decaf: A deep convolutional activation feature for generic visual recognition
Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., and Darrell, T. Decaf: A deep convolutional activation feature for generic visual recognition. In International Conference on Machine Learning, pp.\ 647--655, 2014
work page 2014
-
[16]
T., Riedmiller, M., and Brox, T
Dosovitskiy, A., Springenberg, J. T., Riedmiller, M., and Brox, T. Discriminative unsupervised feature learning with convolutional neural networks. In Advances in neural information processing systems, pp.\ 766--774, 2014
work page 2014
-
[17]
K., Winn, J., and Zisserman, A
Everingham, M., Van Gool, L., Williams, C. K., Winn, J., and Zisserman, A. The pascal visual object classes (voc) challenge. International Journal of Computer Vision, 88 0 (2): 0 303--338, 2010
work page 2010
-
[18]
Fei-Fei, L., Fergus, R., and Perona, P. Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object categories. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshop on Generative-Model Based Vision, 2004
work page 2004
-
[19]
Unsupervised representation learning by predicting image rotations
Gidaris, S., Singh, P., and Komodakis, N. Unsupervised representation learning by predicting image rotations. arXiv preprint arXiv:1803.07728, 2018
-
[20]
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and Bengio, Y. Generative adversarial nets. In Advances in neural information processing systems, pp.\ 2672--2680, 2014
work page 2014
-
[21]
Accurate, Large Minibatch SGD: Training ImageNet in 1 Hour
Goyal, P., Doll \'a r, P., Girshick, R., Noordhuis, P., Wesolowski, L., Kyrola, A., Tulloch, A., Jia, Y., and He, K. Accurate, large minibatch sgd: Training imagenet in 1 hour. arXiv preprint arXiv:1706.02677, 2017
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[22]
Dimensionality reduction by learning an invariant mapping
Hadsell, R., Chopra, S., and LeCun, Y. Dimensionality reduction by learning an invariant mapping. In 2006 IEEE Computer Society Conference on Computer Vision and Pattern Recognition (CVPR'06), volume 2, pp.\ 1735--1742. IEEE, 2006
work page 2006
-
[23]
Deep residual learning for image recognition
He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 770--778, 2016
work page 2016
-
[24]
Momentum contrast for unsupervised visual representation learning
He, K., Fan, H., Wu, Y., Xie, S., and Girshick, R. Momentum contrast for unsupervised visual representation learning. arXiv preprint arXiv:1911.05722, 2019
-
[25]
J., Razavi, A., Doersch, C., Eslami, S., and Oord, A
H \'e naff, O. J., Razavi, A., Doersch, C., Eslami, S., and Oord, A. v. d. Data-efficient image recognition with contrastive predictive coding. arXiv preprint arXiv:1905.09272, 2019
-
[26]
E., Osindero, S., and Teh, Y.-W
Hinton, G. E., Osindero, S., and Teh, Y.-W. A fast learning algorithm for deep belief nets. Neural computation, 18 0 (7): 0 1527--1554, 2006
work page 2006
-
[27]
Learning deep representations by mutual information estimation and maximization
Hjelm, R. D., Fedorov, A., Lavoie-Marchildon, S., Grewal, K., Bachman, P., Trischler, A., and Bengio, Y. Learning deep representations by mutual information estimation and maximization. arXiv preprint arXiv:1808.06670, 2018
work page Pith review arXiv 2018
- [28]
-
[29]
Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Ioffe, S. and Szegedy, C. Batch normalization: Accelerating deep network training by reducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015
work page internal anchor Pith review Pith/arXiv arXiv 2015
-
[30]
Ji, X., Henriques, J. F., and Vedaldi, A. Invariant information clustering for unsupervised image classification and segmentation. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 9865--9874, 2019
work page 2019
-
[31]
Kingma, D. P. and Welling, M. Auto-encoding variational bayes. arXiv preprint arXiv:1312.6114, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[32]
Revisiting self-supervised visual representation learning
Kolesnikov, A., Zhai, X., and Beyer, L. Revisiting self-supervised visual representation learning. In Proceedings of the IEEE conference on Computer Vision and Pattern Recognition, pp.\ 1920--1929, 2019
work page 1920
-
[33]
Kornblith, S., Shlens, J., and Le, Q. V. Do better ImageNet models transfer better? In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 2661--2671, 2019
work page 2019
-
[34]
Collecting a large-scale dataset of fine-grained cars
Krause, J., Deng, J., Stark, M., and Fei-Fei, L. Collecting a large-scale dataset of fine-grained cars. In Second Workshop on Fine-Grained Visual Categorization, 2013
work page 2013
-
[35]
Krizhevsky, A. and Hinton, G. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009. URL https://www.cs.toronto.edu/ kriz/learning-features-2009-TR.pdf
work page 2009
-
[36]
Krizhevsky, A., Sutskever, I., and Hinton, G. E. Imagenet classification with deep convolutional neural networks. In Advances in neural information processing systems, pp.\ 1097--1105, 2012
work page 2012
-
[37]
SGDR: Stochastic Gradient Descent with Warm Restarts
Loshchilov, I. and Hutter, F. Sgdr: Stochastic gradient descent with warm restarts. arXiv preprint arXiv:1608.03983, 2016
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[38]
Maaten, L. v. d. and Hinton, G. Visualizing data using t-sne. Journal of machine learning research, 9 0 (Nov): 0 2579--2605, 2008
work page 2008
-
[39]
Fine-grained visual classification of aircraft
Maji, S., Kannala, J., Rahtu, E., Blaschko, M., and Vedaldi, A. Fine-grained visual classification of aircraft. Technical report, 2013
work page 2013
-
[40]
Efficient Estimation of Word Representations in Vector Space
Mikolov, T., Chen, K., Corrado, G., and Dean, J. Efficient estimation of word representations in vector space. arXiv preprint arXiv:1301.3781, 2013
work page internal anchor Pith review Pith/arXiv arXiv 2013
-
[41]
Misra, I. and van der Maaten, L. Self-supervised learning of pretext-invariant representations. arXiv preprint arXiv:1912.01991, 2019
-
[42]
Nilsback, M.-E. and Zisserman, A. Automated flower classification over a large number of classes. In Computer Vision, Graphics & Image Processing, 2008. ICVGIP'08. Sixth Indian Conference on, pp.\ 722--729. IEEE, 2008
work page 2008
-
[43]
Noroozi, M. and Favaro, P. Unsupervised learning of visual representations by solving jigsaw puzzles. In European Conference on Computer Vision, pp.\ 69--84. Springer, 2016
work page 2016
-
[44]
Oord, A. v. d., Li, Y., and Vinyals, O. Representation learning with contrastive predictive coding. arXiv preprint arXiv:1807.03748, 2018
work page internal anchor Pith review Pith/arXiv arXiv 2018
-
[45]
M., Vedaldi, A., Zisserman, A., and Jawahar, C
Parkhi, O. M., Vedaldi, A., Zisserman, A., and Jawahar, C. Cats and dogs. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3498--3505. IEEE, 2012
work page 2012
-
[46]
Imagenet large scale visual recognition challenge
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of computer vision, 115 0 (3): 0 211--252, 2015
work page 2015
-
[47]
Facenet: A unified embedding for face recognition and clustering
Schroff, F., Kalenichenko, D., and Philbin, J. Facenet: A unified embedding for face recognition and clustering. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 815--823, 2015
work page 2015
-
[48]
Very Deep Convolutional Networks for Large-Scale Image Recognition
Simonyan, K. and Zisserman, A. Very deep convolutional networks for large-scale image recognition. arXiv preprint arXiv:1409.1556, 2014
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[49]
Improved deep metric learning with multi-class n-pair loss objective
Sohn, K. Improved deep metric learning with multi-class n-pair loss objective. In Advances in neural information processing systems, pp.\ 1857--1865, 2016
work page 2016
-
[50]
D., Kurakin, A., Zhang, H., and Raffel, C
Sohn, K., Berthelot, D., Li, C.-L., Zhang, Z., Carlini, N., Cubuk, E. D., Kurakin, A., Zhang, H., and Raffel, C. Fixmatch: Simplifying semi-supervised learning with consistency and confidence. arXiv preprint arXiv:2001.07685, 2020
-
[51]
Going deeper with convolutions
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Vanhoucke, V., and Rabinovich, A. Going deeper with convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\ 1--9, 2015
work page 2015
-
[52]
arXiv preprint arXiv:1906.05849 , year=
Tian, Y., Krishnan, D., and Isola, P. Contrastive multiview coding. arXiv preprint arXiv:1906.05849, 2019
-
[53]
Tschannen, M., Djolonga, J., Rubenstein, P. K., Gelly, S., and Lucic, M. On mutual information maximization for representation learning. arXiv preprint arXiv:1907.13625, 2019
-
[54]
Wu, Z., Xiong, Y., Yu, S. X., and Lin, D. Unsupervised feature learning via non-parametric instance discrimination. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 3733--3742, 2018
work page 2018
-
[55]
A., Oliva, A., and Torralba, A
Xiao, J., Hays, J., Ehinger, K. A., Oliva, A., and Torralba, A. Sun database: Large-scale scene recognition from abbey to zoo. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), pp.\ 3485--3492. IEEE, 2010
work page 2010
- [56]
-
[57]
Ye, M., Zhang, X., Yuen, P. C., and Chang, S.-F. Unsupervised embedding learning via invariant and spreading instance feature. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp.\ 6210--6219, 2019
work page 2019
-
[58]
Large Batch Training of Convolutional Networks
You, Y., Gitman, I., and Ginsburg, B. Large batch training of convolutional networks. arXiv preprint arXiv:1708.03888, 2017
work page Pith review arXiv 2017
-
[59]
S4l: Self-supervised semi-supervised learning
Zhai, X., Oliver, A., Kolesnikov, A., and Beyer, L. S4l: Self-supervised semi-supervised learning. In The IEEE International Conference on Computer Vision (ICCV), October 2019
work page 2019
-
[60]
Zhang, R., Isola, P., and Efros, A. A. Colorful image colorization. In European conference on computer vision, pp.\ 649--666. Springer, 2016
work page 2016
-
[61]
Zhuang, C., Zhai, A. L., and Yamins, D. Local aggregation for unsupervised learning of visual embeddings. In Proceedings of the IEEE International Conference on Computer Vision, pp.\ 6002--6012, 2019
work page 2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.