Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
arXiv preprint arXiv:2102.05644 (2021)
3 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
fields
cs.CV 3roles
background 1polarities
background 1representative citing papers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Pre-trained ViT representations combined with active learning and targeted design choices for annotations and selection improve object class retrieval in multi-object scenes.
citing papers explorer
-
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
Revisiting Human-in-the-Loop Object Retrieval with Pre-Trained Vision Transformers
Pre-trained ViT representations combined with active learning and targeted design choices for annotations and selection improve object class retrieval in multi-object scenes.