Coatnet: Marrying convolution and attention for all data sizes

Zihang Dai, Hanxiao Liu, Quoc V Le, Mingxing Tan · 2009 · arXiv 2106.04803

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

representative citing papers

Florence: A New Foundation Model for Computer Vision

cs.CV · 2021-11-22 · unverdicted · novelty 6.0

Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

cs.CV · 2021-10-05 · unverdicted · novelty 6.0

MobileViT is a lightweight vision transformer that reports 78.4% top-1 accuracy on ImageNet-1k with ~6M parameters, outperforming MobileNetv3 by 3.2% and DeIT by 6.2% at similar size, plus gains on MS-COCO detection.

Advancing Vision Transformer with Enhanced Spatial Priors

cs.CV · 2026-04-20 · unverdicted · novelty 4.0

EVT improves Vision Transformers by using Euclidean distance decay for spatial priors and simpler grouping, achieving 86.6% top-1 accuracy on ImageNet-1k.

citing papers explorer

Showing 3 of 3 citing papers.

Florence: A New Foundation Model for Computer Vision cs.CV · 2021-11-22 · unverdicted · none · ref 5
Florence is a new vision foundation model that learns universal visual-language representations from web-scale data and reports state-of-the-art results on 44 benchmarks including 83.74% zero-shot ImageNet top-1 accuracy.
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer cs.CV · 2021-10-05 · unverdicted · none · ref 4
MobileViT is a lightweight vision transformer that reports 78.4% top-1 accuracy on ImageNet-1k with ~6M parameters, outperforming MobileNetv3 by 3.2% and DeIT by 6.2% at similar size, plus gains on MS-COCO detection.
Advancing Vision Transformer with Enhanced Spatial Priors cs.CV · 2026-04-20 · unverdicted · none · ref 44
EVT improves Vision Transformers by using Euclidean distance decay for spatial priors and simpler grouping, achieving 86.6% top-1 accuracy on ImageNet-1k.

Coatnet: Marrying convolution and attention for all data sizes

fields

years

verdicts

representative citing papers

citing papers explorer