Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
Unsupervised visual representation learning by context prediction
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3representative citing papers
Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.
citing papers explorer
-
Masked Autoencoders Are Scalable Vision Learners
Masked autoencoders with asymmetric encoder-decoder and 75% masking ratio enable scalable self-supervised pre-training of vision transformers, achieving 87.8% ImageNet-1K accuracy with ViT-Huge using only unlabeled data.
-
Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning
Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
-
InternVideo: General Video Foundation Models via Generative and Discriminative Learning
InternVideo combines masked video modeling and video-language contrastive learning into a single foundation model that reaches state-of-the-art results on 39 video datasets including 91.1% top-1 on Kinetics-400.