Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
arXiv preprint arXiv:2112.10740 , year=
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
Survey benchmarks SSL instance discrimination and masked image modeling for object detection, finding instance discrimination suits CNN encoders while MIM suits ViT encoders and custom pre-training, especially for small objects.
Pith review generated a malformed one-line summary.
citing papers explorer
-
Towards Understanding Self-Pretraining for Sequence Classification
Self-pretraining improves Transformer sequence classification by enabling learning of proximity-biased attention from positional encodings that label supervision alone cannot easily acquire from random starts.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
Self-Supervised Learning for Real-World Object Detection: a Survey
Survey benchmarks SSL instance discrimination and masked image modeling for object detection, finding instance discrimination suits CNN encoders while MIM suits ViT encoders and custom pre-training, especially for small objects.
-
DINOv2: Learning Robust Visual Features without Supervision
Pith review generated a malformed one-line summary.