Adding register tokens to Vision Transformers eliminates high-norm background artifacts and raises state-of-the-art performance on dense visual prediction tasks.
preprint arXiv:2010.10504 , year=
4 Pith papers cite this work. Polarity classification is still indexing.
verdicts
UNVERDICTED 4representative citing papers
Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.
ViP-VL achieves claimed state-of-the-art results on Vietnamese ASR, emotion recognition, dialect classification, and speaker verification via vector-quantization self-supervised pretraining on 17k hours with 8x subsampling modifications.
Evaluation of open-source and commercial ASR models on narrow-band Hindi and Indian English shows poor zero-shot results and inconsistent fine-tuning benefits tied to pretraining exposure.
citing papers explorer
-
Adopting State-of-the-Art Pretrained Audio Representations for Music Recommender Systems
Pretrained audio models show large performance gaps between standard MIR tasks and music recommendation in both hot and cold-start settings.
-
ViP-VL: Vietnamese Self-supervised Speech Pretraining Model with Vector-Quantization Learning
ViP-VL achieves claimed state-of-the-art results on Vietnamese ASR, emotion recognition, dialect classification, and speaker verification via vector-quantization self-supervised pretraining on 17k hours with 8x subsampling modifications.
-
Responsible ASR: Overcoming Challenges of Foundational Models in Narrow-Band and Low-Resource Settings
Evaluation of open-source and commercial ASR models on narrow-band Hindi and Indian English shows poor zero-shot results and inconsistent fine-tuning benefits tied to pretraining exposure.