Accelerating Vision Transformers with Adaptive Patch Sizes

Rohan Choudhury , JungEun Kim , Jinhyung Park , Eunho Yang , L\'aszl\'o A. Jeni , Kris M. Kitani

Authors on Pith no claims yet

classification 💻 cs.CV cs.AIcs.LG

keywords patchinferenceinputsizestrainingtransformersadaptivehigh-resolution

read the original abstract

Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Token Warping Helps MLLMs Look from Nearby Viewpoints
cs.CV 2026-04 unverdicted novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking
cs.CV 2026-03 unverdicted novelty 7.0

DC-DiT learns dynamic chunking to allocate fewer tokens to smooth or noisy regions and more to detailed or late-stage areas, cutting inference FLOPs up to 36.8% while improving FID up to 37.8% on class-conditional Ima...
TrajTok: Learning Trajectory Tokens enables better Video Understanding
cs.CV 2026-02 unverdicted novelty 7.0

TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.