An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby · 2021

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

browse 1 citing papers

representative citing papers

Hierarchically Robust Zero-shot Vision-language Models

cs.CV · 2026-04-20 · unverdicted · novelty 7.0

A hierarchical adversarial fine-tuning method for VLMs aligns image and text embeddings at multiple hierarchy depths with theoretical margin connections to boost robustness to leaf and superclass attacks while using multiple trees for semantic variety.

citing papers explorer

Showing 1 of 1 citing paper.

Hierarchically Robust Zero-shot Vision-language Models cs.CV · 2026-04-20 · unverdicted · none · ref 11
A hierarchical adversarial fine-tuning method for VLMs aligns image and text embeddings at multiple hierarchy depths with theoretical margin connections to boost robustness to leaf and superclass attacks while using multiple trees for semantic variety.

An image is worth 16x16 words: Transformers for image recognition at scale

fields

years

verdicts

representative citing papers

citing papers explorer