pith. machine review for the scientific record. sign in

arxiv: 2510.18091 · v2 · submitted 2025-10-20 · 💻 cs.CV · cs.AI· cs.LG

Recognition: unknown

Accelerating Vision Transformers with Adaptive Patch Sizes

Authors on Pith no claims yet
classification 💻 cs.CV cs.AIcs.LG
keywords patchinferenceinputsizestrainingtransformersadaptivehigh-resolution
0
0 comments X
read the original abstract

Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Token Warping Helps MLLMs Look from Nearby Viewpoints

    cs.CV 2026-04 unverdicted novelty 7.0

    Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.

  2. DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking

    cs.CV 2026-03 unverdicted novelty 7.0

    DC-DiT learns dynamic chunking to allocate fewer tokens to smooth or noisy regions and more to detailed or late-stage areas, cutting inference FLOPs up to 36.8% while improving FID up to 37.8% on class-conditional Ima...

  3. TrajTok: Learning Trajectory Tokens enables better Video Understanding

    cs.CV 2026-02 unverdicted novelty 7.0

    TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.