An image is worth 16x16 words: Transformers for image recognition at scale.ICLR

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby · 2021

2 Pith papers cite this work. Polarity classification is still indexing.

2 Pith papers citing it

browse 2 citing papers

representative citing papers

Weakly-Supervised Referring Video Object Segmentation through Text Supervision

cs.CV · 2026-04-20 · unverdicted · novelty 6.0

WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.

SAT: Selective Aggregation Transformer for Image Super-Resolution

cs.CV · 2026-04-09 · unverdicted · novelty 5.0

SAT introduces density and isolation-based token aggregation to enable efficient global attention in super-resolution transformers, claiming up to 0.22 dB PSNR gain and 27% FLOP reduction over PFT.

citing papers explorer

Showing 2 of 2 citing papers.

Weakly-Supervised Referring Video Object Segmentation through Text Supervision cs.CV · 2026-04-20 · unverdicted · none · ref 12
WSRVOS enables referring video object segmentation with text-only supervision by combining MLLM-based expression augmentation, multimodal feature interaction, pseudo-mask fusion, and temporal ranking constraints.
SAT: Selective Aggregation Transformer for Image Super-Resolution cs.CV · 2026-04-09 · unverdicted · none · ref 18
SAT introduces density and isolation-based token aggregation to enable efficient global attention in super-resolution transformers, claiming up to 0.22 dB PSNR gain and 27% FLOP reduction over PFT.

An image is worth 16x16 words: Transformers for image recognition at scale.ICLR

fields

years

verdicts

representative citing papers

citing papers explorer