Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
preprint arXiv:2010.10504 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.
ViP-VL achieves claimed state-of-the-art results on Vietnamese ASR, emotion recognition, dialect classification, and speaker verification via vector-quantization self-supervised pretraining on 17k hours with 8x subsampling modifications.
Evaluation of open-source and commercial ASR models on narrow-band Hindi and Indian English shows poor zero-shot results and inconsistent fine-tuning benefits tied to pretraining exposure.
citing papers explorer
-
Vision Transformers Need Registers
Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.