Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
preprint arXiv:2007.06346 , year=
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3representative citing papers
VICReg prevents collapse in self-supervised image embeddings via explicit variance, invariance, and covariance regularization and matches state-of-the-art downstream performance.
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
citing papers explorer
-
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
-
VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning
VICReg prevents collapse in self-supervised image embeddings via explicit variance, invariance, and covariance regularization and matches state-of-the-art downstream performance.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.