An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al · 2021

4 Pith papers cite this work. Polarity classification is still indexing.

4 Pith papers citing it

browse 4 citing papers

representative citing papers

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning

cs.CV · 2025-07-18 · conditional · novelty 6.0

Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.

MetaMorph: Multimodal Understanding and Generation via Instruction Tuning

cs.CV · 2024-12-18 · unverdicted · novelty 6.0

VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.

InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation

cs.CV · 2023-07-13 · unverdicted · novelty 6.0

InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.

PartCo: Part-Level Correspondence Priors Enhance Category Discovery

cs.CV · 2025-09-26 · unverdicted · novelty 5.0

PartCo improves generalized category discovery by incorporating part-level correspondence priors that capture finer semantic structures and integrate with existing GCD methods.

citing papers explorer

Showing 4 of 4 citing papers.

Franca: Nested Matryoshka Clustering for Scalable Visual Representation Learning cs.CV · 2025-07-18 · conditional · none · ref 52
Franca introduces nested Matryoshka clustering and positional disentanglement in a transparent SSL pipeline to deliver open-source vision models competitive with closed proprietary systems.
MetaMorph: Multimodal Understanding and Generation via Instruction Tuning cs.CV · 2024-12-18 · unverdicted · none · ref 16
VPiT enables pretrained LLMs to perform both visual understanding and generation by predicting discrete text tokens and continuous visual tokens, with understanding data proving more effective than generation-specific data.
InternVid: A Large-scale Video-Text Dataset for Multimodal Understanding and Generation cs.CV · 2023-07-13 · unverdicted · none · ref 68
InternVid supplies 7M videos and LLM captions to train ViCLIP, which reaches leading zero-shot action recognition and competitive retrieval performance.
PartCo: Part-Level Correspondence Priors Enhance Category Discovery cs.CV · 2025-09-26 · unverdicted · none · ref 14
PartCo improves generalized category discovery by incorporating part-level correspondence priors that capture finer semantic structures and integrate with existing GCD methods.

An image is worth 16x16 words: Transformers for image recognition at scale

fields

years

verdicts

representative citing papers

citing papers explorer