Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
hub
preprint arXiv:1904.12848 , year=
12 Pith papers cite this work. Polarity classification is still indexing.
hub tools
representative citing papers
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.
XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
GraphStar is a new GNN that adds star nodes and relay attention to achieve non-local representations for node, graph, and link tasks, claiming 2-5% gains over prior SOTA on benchmarks.
Invariance-inducing regularization using worst-case transformations reduces relative error by 20% on CIFAR10 transformed examples, improves standard accuracy on SVHN, outperforms equivariant networks, and proves no accuracy-robustness trade-off in the infinite data limit.
Graph imputation neural networks augment semi-supervised datasets up to 10x by reconstructing heavily damaged samples on a similarity graph, improving over fully-supervised baselines on benchmarks.
Deep learning models extract content-agnostic voice biomarkers for depression and anxiety from a ~65k-utterance proprietary dataset, achieving 71% sensitivity and specificity when combined with lexical features.
Decomposing automotive query understanding into a lightweight classification stage followed by specialized entity extraction yields better accuracy and lower latency than joint single-step processing.
citing papers explorer
-
Emerging Properties in Self-Supervised Vision Transformers
Self-supervised ViTs show emergent semantic segmentation and 78.3% k-NN accuracy on ImageNet; DINO reaches 80.1% linear evaluation with ViT-Base.
-
Longformer: The Long-Document Transformer
Longformer uses local windowed attention plus task-specific global attention to achieve linear scaling and state-of-the-art results on long-document language modeling, QA, and summarization after pretraining.
-
A Simple Framework for Contrastive Learning of Visual Representations
SimCLR learns visual representations by contrasting augmented views of the same image and reaches 76.5% ImageNet top-1 accuracy with a linear classifier, matching a supervised ResNet-50.
-
Unsupervised Cross-lingual Representation Learning at Scale
XLM-R, pretrained on 100 languages with 2TB of CommonCrawl data, improves average XNLI accuracy by 14.6 points and MLQA F1 by 13 points over mBERT while matching strong monolingual models on GLUE.
-
XLNet: Generalized Autoregressive Pretraining for Language Understanding
XLNet is a generalized autoregressive pretraining method that learns bidirectional contexts via permutation-based factorization and outperforms BERT on 20 NLP tasks.
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
-
Graph Star Net for Generalized Multi-Task Learning
GraphStar is a new GNN that adds star nodes and relay attention to achieve non-local representations for node, graph, and link tasks, claiming 2-5% gains over prior SOTA on benchmarks.
-
Invariance-inducing regularization using worst-case transformations suffices to boost accuracy and spatial robustness
Invariance-inducing regularization using worst-case transformations reduces relative error by 20% on CIFAR10 transformed examples, improves standard accuracy on SVHN, outperforms equivariant networks, and proves no accuracy-robustness trade-off in the infinite data limit.
-
Efficient data augmentation using graph imputation neural networks
Graph imputation neural networks augment semi-supervised datasets up to 10x by reconstructing heavily damaged samples on a similarity graph, improving over fully-supervised baselines on benchmarks.
-
Voice Biomarkers for Depression and Anxiety
Deep learning models extract content-agnostic voice biomarkers for depression and anxiety from a ~65k-utterance proprietary dataset, achieving 71% sensitivity and specificity when combined with lexical features.
-
Domain-Specific Query Understanding for Automotive Applications: A Modular and Scalable Approach
Decomposing automotive query understanding into a lightweight classification stage followed by specialized entity extraction yields better accuracy and lower latency than joint single-step processing.