V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
arXiv preprint arXiv:1903.03825 , year=
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
representative citing papers
An ensemble of CRNNs trained with consistency regularization and MixUp on mixed labeled/unlabeled data reaches 42.0% event-based F-measure on DCASE 2019 Task 4, beating the 25.8% baseline.
citing papers explorer
-
Revisiting Feature Prediction for Learning Visual Representations from Video
V-JEPA models trained only on feature prediction from 2 million public videos achieve 81.9% on Kinetics-400, 72.2% on Something-Something-v2, and 77.9% on ImageNet-1K using frozen ViT-H/16 backbones.
-
HODGEPODGE: Sound event detection based on ensemble of semi-supervised learning methods
An ensemble of CRNNs trained with consistency regularization and MixUp on mixed labeled/unlabeled data reaches 42.0% event-based F-measure on DCASE 2019 Task 4, beating the 25.8% baseline.