Unsupervised Representation Learning by Predicting Image Rotations
read the original abstract
Over the last years, deep convolutional neural networks (ConvNets) have transformed the field of computer vision thanks to their unparalleled capacity to learn high level semantic image features. However, in order to successfully learn those features, they usually require massive amounts of manually labeled data, which is both expensive and impractical to scale. Therefore, unsupervised semantic feature learning, i.e., learning without requiring manual annotation effort, is of crucial importance in order to successfully harvest the vast amount of visual data that are available today. In our work we propose to learn image features by training ConvNets to recognize the 2d rotation that is applied to the image that it gets as input. We demonstrate both qualitatively and quantitatively that this apparently simple task actually provides a very powerful supervisory signal for semantic feature learning. We exhaustively evaluate our method in various unsupervised feature learning benchmarks and we exhibit in all of them state-of-the-art performance. Specifically, our results on those benchmarks demonstrate dramatic improvements w.r.t. prior state-of-the-art approaches in unsupervised representation learning and thus significantly close the gap with supervised feature learning. For instance, in PASCAL VOC 2007 detection task our unsupervised pre-trained AlexNet model achieves the state-of-the-art (among unsupervised methods) mAP of 54.4% that is only 2.4 points lower from the supervised case. We get similarly striking results when we transfer our unsupervised learned features on various other tasks, such as ImageNet classification, PASCAL classification, PASCAL segmentation, and CIFAR-10 classification. The code and models of our paper will be published on: https://github.com/gidariss/FeatureLearningRotNet .
This paper has not been read by Pith yet.
Forward citations
Cited by 19 Pith papers
-
REMAP: Regularized Matching and Partial Alignment of Video Embeddings
REMAP applies regularized fused partial Gromov-Wasserstein optimal transport to align video embeddings for unsupervised procedure learning on noisy instructional videos.
-
A Simple Framework for Contrastive Learning of Visual Representations
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
-
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
TAPE introduces temporal-aware token pruning for diffusion-based video generation, using frame smoothing, layer reselection, and timestep budgets to achieve speedups while maintaining visual fidelity and coherence.
-
Multi-hop Relational Contrastive Learning: Extending Spatial Contrastive Pre-training Beyond Pairwise Relations
MRCL extends pairwise spatial contrastive pre-training to multi-hop paths in scene graphs, yielding NDCG@5 = 0.748 on GQA graph retrieval and gains on spatial recognition and QA tasks.
-
MU-SHOT-Fi: Self-Supervised Multi-User Wi-Fi Sensing with Source-free Unsupervised Domain Adaptation
MU-SHOT-Fi is a source-free UDA framework for multi-user WiFi HAR using permutation-invariant set prediction, occupancy-weighted information maximization, and binary rotation prediction to handle domain shifts.
-
MU-SHOT-Fi: Self-Supervised Multi-User Wi-Fi Sensing with Source-free Unsupervised Domain Adaptation
MU-SHOT-Fi recovers multi-user activity classification accuracy under domain shifts in WiFi CSI sensing using source-free adaptation with Hungarian matching, occupancy-weighted entropy regularization, and rotation pre...
-
Self-supervised pretraining for an iterative image size agnostic vision transformer
A sequential-to-global SSL method based on DINO pretrains iterative foveal-inspired vision transformers to achieve competitive ImageNet-1K performance with constant compute regardless of input resolution.
-
Probing Intrinsic Medical Task Relationships: A Contrastive Learning Perspective
TaCo contrastively embeds semantic, generative, and transformation tasks from medical imaging into a joint space to reveal which tasks cluster, blend, or remain distinct.
-
gen2seg: Generative Models Enable Generalizable Instance Segmentation
Finetuning generative models on limited instance segmentation data produces zero-shot generalization to unseen object categories and styles, matching or exceeding supervised baselines like SAM on ambiguous boundaries.
-
Orthogonal Subspace Decomposition for Generalizable AI-Generated Image Detection
Orthogonal subspace decomposition via SVD on vision foundation model features preserves high-rank pre-trained knowledge by freezing principal components and adapting residuals, reducing overfitting for better generali...
-
Vector-quantized Image Modeling with Improved VQGAN
Improved ViT-VQGAN enables autoregressive Transformer pretraining on ImageNet tokens to reach IS 175.1 and FID 4.17 for generation plus 73.2% linear-probe accuracy, beating prior iGPT models.
-
Multi-task Self-Supervised Learning for Human Activity Detection
A multi-task self-supervised approach trains a temporal CNN to detect transformations on sensory data, yielding features that match or exceed fully supervised performance in semi-supervised and transfer settings for s...
-
A Dialogue between Causal and Traditional Representation Learning: Toward Mutual Benefits in a Unified Formulation
The paper introduces a unified formulation for representation learning with task and constraint components, arguing for mutual benefits between causal and traditional approaches and showing via experiments that causal...
-
Temporal Aware Pruning for Efficient Diffusion-based Video Generation
TAPE applies temporal-aware token pruning with smoothing, reselection, and timestep scheduling to speed up video diffusion models while preserving visual fidelity and coherence.
-
Information theoretic underpinning of self-supervised learning by clustering
SSL clustering is derived as KL-divergence optimization where a teacher-distribution constraint normalizes via inverse cluster priors and simplifies to batch centering by Jensen's inequality.
-
On the Power of Foundation Models
Category theory proves prompt-based learning on perfect foundation models works only for representable tasks, fine-tuning solves tasks in the pretext category, and models can represent unseen target-category objects u...
-
From pre-training to downstream performance: Does domain-specific pre-training make sense?
Pre-training on modality-matched data significantly improves downstream performance in medical imaging models while self-supervised learning benefits depend on context.
-
MAE-SAM2: Mask Autoencoder-Enhanced SAM2 for Clinical Retinal Vascular Leakage Segmentation
MAE-SAM2 integrates MAE self-supervised learning with SAM2 to achieve superior segmentation of retinal vascular leakage on fluorescein angiography images, with highest Dice/IoU scores and 5% improvement over original SAM2.
-
Accurate and Robust Pulmonary Nodule Detection by 3D Feature Pyramid Network with Self-supervised Feature Learning
A 3DFPN with self-supervised pretraining and HS2 false-positive reduction using location history images reaches 90.6% sensitivity at 0.125 FP/scan on LUNA16, claimed 15.8% above prior results.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.