An image is worth 16x16 words: Transformers for image recognition at scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby · 2021

6 Pith papers cite this work. Polarity classification is still indexing.

6 Pith papers citing it

browse 6 citing papers

citation-role summary

background 2

citation-polarity summary

background 2

representative citing papers

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers

cs.CV · 2026-05-08 · unverdicted · novelty 7.0

PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.

Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling

cs.CV · 2026-05-18 · unverdicted · novelty 6.0

Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.

ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference

cs.DC · 2026-05-11 · unverdicted · novelty 6.0

ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.

PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection

cs.CV · 2025-09-30 · unverdicted · novelty 6.0

PRPO is a paragraph-level policy optimization technique that grounds vision-language model reasoning in image content to raise deepfake detection accuracy and reasoning quality.

Bernini: Latent Semantic Planning for Video Diffusion

cs.CV · 2026-05-21 · unverdicted · novelty 5.0

Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.

ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines

cs.CV · 2026-04-13 · unverdicted · novelty 5.0

ConvFormer3D-TAP classifies six cine CMR views at 96% accuracy using 3D conv tokenization, multiscale attention, and uncertainty-aware multi-clip fusion on 150k sequences.

citing papers explorer

Showing 6 of 6 citing papers.

PaceVGGT: Pre-Alternating-Attention Token Pruning for Visual Geometry Transformers cs.CV · 2026-05-08 · unverdicted · none · ref 6
PaceVGGT reduces VGGT inference latency by up to 5.1x on ScanNet-50 via pre-AA token pruning with a distilled Token Scorer, per-frame keep budgets, adaptive merge/prune, and feature-guided restoration, while preserving reconstruction quality on ScanNet-50 and 7-Scenes.
Resolving Representation Ambiguity in Feedforward Novel View Synthesis Transformer via Semantic-Spatial Decoupling cs.CV · 2026-05-18 · unverdicted · none · ref 9
Decouples semantic and spatial tokens in NVS transformers to resolve representation ambiguity, yielding consistent gains with near-zero added latency.
ChunkFlow: Communication-Aware Chunked Prefetching for Layerwise Offloading in Distributed Diffusion Transformer Inference cs.DC · 2026-05-11 · unverdicted · none · ref 7
ChunkFlow achieves up to 1.28x step-time speedup and up to 49% lower peak GPU memory for DiT inference by using a first-order model to guide communication-aware chunked prefetching.
PRPO: Paragraph-level Policy Optimization for Vision-Language Deepfake Detection cs.CV · 2025-09-30 · unverdicted · none · ref 12
PRPO is a paragraph-level policy optimization technique that grounds vision-language model reasoning in image content to raise deepfake detection accuracy and reasoning quality.
Bernini: Latent Semantic Planning for Video Diffusion cs.CV · 2026-05-21 · unverdicted · none · ref 19
Bernini is a framework that uses an MLLM planner to output semantic representations for a DiT renderer to generate or edit videos, reporting SOTA benchmark performance.
ConvFormer3D-TAP: Phase/Uncertainty-Aware Front-End Fusion for Cine CMR View Classification Pipelines cs.CV · 2026-04-13 · unverdicted · none · ref 24
ConvFormer3D-TAP classifies six cine CMR views at 96% accuracy using 3D conv tokenization, multiscale attention, and uncertainty-aware multi-clip fusion on 150k sequences.

An image is worth 16x16 words: Transformers for image recognition at scale

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer