Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
hub
arXiv preprint arXiv:1904.12848 (Apr 2019)
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.
XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
GraphStar is a new GNN that adds star nodes and relay attention to achieve non-local representations for node, graph, and link tasks, claiming 2-5% gains over prior SOTA on benchmarks.
Invariance-inducing regularization using worst-case transformations reduces relative error by 20% on CIFAR10 transformed examples, improves standard accuracy on SVHN, outperforms equivariant networks, and proves no accuracy-robustness trade-off in the infinite data limit.
Graph imputation neural networks augment semi-supervised datasets up to 10x by reconstructing heavily damaged samples on a similarity graph, improving over fully-supervised baselines on benchmarks.
Deep learning models extract content-agnostic voice biomarkers for depression and anxiety from a ~65k-utterance proprietary dataset, achieving 71% sensitivity and specificity when combined with lexical features.
Decomposing automotive query understanding into a lightweight classification stage followed by specialized entity extraction yields better accuracy and lower latency than joint single-step processing.
citing papers explorer
-
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.