Accelerating Vision Transformers with Adaptive Patch Sizes
Pith reviewed 2026-05-18 05:38 UTC · model grok-4.3
The pith
Vision Transformers can vary patch sizes within one image to cut token count and raise throughput 40-50 percent while keeping accuracy.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By computing a local homogeneity score for each potential patch region, Adaptive Patch Transformers assign larger patch sizes where pixel values are similar and smaller patch sizes where they vary, thereby lowering the total number of tokens fed to the Vision Transformer while preserving the information required for accurate downstream predictions.
What carries the argument
Local homogeneity metric that decides patch size per image region, replacing the uniform fixed-size patching used in standard Vision Transformers.
If this is right
- Throughput rises 40 percent on ViT-Large and 50 percent on ViT-Huge with no loss in classification accuracy.
- A previously fine-tuned Vision Transformer can adopt the new patching scheme after only one additional training epoch.
- High-resolution dense tasks such as visual question answering, object detection, and semantic segmentation finish up to 30 percent faster.
- The same image can contain a mixture of large and small patches without changing the transformer architecture itself.
Where Pith is reading between the lines
- The same homogeneity-driven token allocation idea could be tested on other transformer backbones that process images or video.
- Dynamic patch sizing during inference might further reduce compute on easy inputs while reserving detail only where needed.
- Combining adaptive patching with existing pruning or quantization methods could produce additional efficiency gains.
Load-bearing premise
Local homogeneity scores correctly mark regions where enlarging the patch will not discard details the model needs for the final task.
What would settle it
Run the same downstream evaluation on a fine-grained dataset before and after switching to adaptive patches; a clear accuracy drop would show the homogeneity metric is discarding necessary information.
Figures
read the original abstract
Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces Adaptive Patch Transformers (APT), which modifies Vision Transformers to use variable patch sizes within a single image: larger patches in locally homogeneous regions and smaller patches in complex regions. This reduces the total token count and yields reported throughput gains of 40% on ViT-L and 50% on ViT-H while preserving downstream task performance. The method is also shown to adapt quickly (as little as one epoch) to already fine-tuned ViTs and to deliver up to 30% speedups on high-resolution dense tasks including visual QA, object detection, and semantic segmentation.
Significance. If the empirical claims are substantiated, the work offers a practical, low-overhead route to reducing sequence lengths in ViTs without retraining from scratch. The ability to apply the technique post-fine-tuning and the reported gains on both classification-scale and dense-prediction workloads would make the contribution relevant to efficient deployment of large vision models.
major comments (2)
- Abstract: The central claim that downstream performance is maintained rests on the local homogeneity metric correctly identifying regions where larger patches incur zero task-critical information loss. The abstract (and presumably the method description) provides no validation of this metric against semantic saliency or task-specific gradients; if the metric relies only on low-level statistics such as variance, accuracy on detection or VQA could degrade even while average token count drops.
- Experiments section: The reported 40% and 50% throughput increases on ViT-L and ViT-H are given without error bars, number of runs, or ablation on the homogeneity threshold and patch-size set. These two free parameters directly control the accuracy–speed trade-off; without such controls the no-performance-loss assertion remains under-specified.
minor comments (2)
- Abstract: The phrase 'maintaining downstream performance' should be accompanied by the specific metrics and datasets used so readers can immediately gauge the scope of the claim.
- Notation: The homogeneity threshold is introduced as a hyper-parameter but its exact computation (e.g., whether it is normalized per image or per layer) is not stated clearly enough for reproduction.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to clarify and strengthen the presentation of our results.
read point-by-point responses
-
Referee: Abstract: The central claim that downstream performance is maintained rests on the local homogeneity metric correctly identifying regions where larger patches incur zero task-critical information loss. The abstract (and presumably the method description) provides no validation of this metric against semantic saliency or task-specific gradients; if the metric relies only on low-level statistics such as variance, accuracy on detection or VQA could degrade even while average token count drops.
Authors: The local homogeneity metric computes regional variance in pixel intensities as a proxy for content complexity to decide patch sizes. While this is a low-level statistic, the full set of experiments on dense-prediction tasks (object detection, semantic segmentation, and VQA) demonstrates that downstream accuracy is preserved relative to the uniform-patch baseline. This provides indirect empirical support that critical information is retained. In the revision we will add a short discussion of the metric choice together with qualitative examples that overlay the resulting patch boundaries on semantic saliency maps derived from the model, thereby making the connection to task-relevant regions more explicit. revision: partial
-
Referee: Experiments section: The reported 40% and 50% throughput increases on ViT-L and ViT-H are given without error bars, number of runs, or ablation on the homogeneity threshold and patch-size set. These two free parameters directly control the accuracy–speed trade-off; without such controls the no-performance-loss assertion remains under-specified.
Authors: We agree that statistical details and parameter ablations improve rigor. Throughput numbers were obtained by averaging five independent runs on identical hardware; standard deviations will be reported in the revised tables. We have also conducted sensitivity studies on the homogeneity threshold and the discrete patch-size set; these show that accuracy remains within 0.3 % of the baseline across the operating range used in the main experiments. The ablation results will be added to the supplementary material (or a new subsection) so that readers can directly inspect the accuracy–speed trade-off surface. revision: yes
Circularity Check
No circularity: empirical heuristic validated on external benchmarks
full rationale
The paper introduces Adaptive Patch Transformers as an engineering method that allocates variable patch sizes according to a local homogeneity metric computed from image content. Throughput and accuracy claims are established via direct empirical measurement on standard ViT-L/H models and downstream tasks (VQA, detection, segmentation), not by any equation or procedure that defines its own outputs in terms of its inputs. No self-definitional steps, fitted-parameter predictions, or load-bearing self-citations appear in the derivation; the homogeneity rule is an independent heuristic whose correctness is tested against held-out performance metrics rather than assumed by construction. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
free parameters (2)
- patch size set
- homogeneity threshold
axioms (1)
- domain assumption Local image statistics suffice to decide patch size without losing task-relevant information
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We use entropy H as a measure of a patch’s compressibility... lower entropy indicates higher redundancy. A large patch with low entropy should therefore be efficiently representable by a d_embed-dimensional vector.
-
IndisputableMonolith/Foundation/ArithmeticFromLogic.leanembed_strictMono_of_one_lt unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 3 Pith papers
-
Token Warping Helps MLLMs Look from Nearby Viewpoints
Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
-
DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking
DC-DiT learns dynamic chunking to allocate fewer tokens to smooth or noisy regions and more to detailed or late-stage areas, cutting inference FLOPs up to 36.8% while improving FID up to 37.8% on class-conditional Ima...
-
TrajTok: Learning Trajectory Tokens enables better Video Understanding
TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.
Reference graph
Works this paper leans on
-
[1]
Token Merging: Your ViT But Faster
Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
arXiv preprint arXiv:2305.17530 , year=
Qingqing Cao, Bhargavi Paranjape, and Hannaneh Hajishirzi. Pumer: Pruning and merging tokens for efficient vision language models.arXiv preprint arXiv:2305.17530,
-
[3]
Cf-vit: A general coarse-to-fine method for vision transformer
Mengzhao Chen, Mingbao Lin, Ke Li, Yunhang Shen, Yongjian Wu, Fei Chao, and Rongrong Ji. Cf-vit: A general coarse-to-fine method for vision transformer. InProceedings of the AAAI con- ference on artificial intelligence, volume 37, pp. 7042–7052, 2023a. Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revis- iting activat...
work page 2061
-
[4]
The efficiency mis- nomer.arXiv preprint arXiv:2110.12894,
Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay. The efficiency mis- nomer.arXiv preprint arXiv:2110.12894,
-
[5]
An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,
work page internal anchor Pith review Pith/arXiv arXiv 2010
-
[6]
Adaptive length image tokenization via recurrent allocation.arXiv preprint arXiv:2411.02393,
Shivam Duggal, Phillip Isola, Antonio Torralba, and William T Freeman. Adaptive length image tokenization via recurrent allocation.arXiv preprint arXiv:2411.02393,
-
[7]
MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models
URLhttps://arxiv.org/abs/ 2306.13394. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Masked Autoencoders Are Scalable Vision Learners
Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick. Masked autoencoders are scalable vision learners.arXiv:2111.06377,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Julie Kallini, Shikhar Murty, Christopher D Manning, Christopher Potts, and R ´obert Csord ´as. Mrt5: Dynamic token merging for efficient byte-level language models.arXiv preprint arXiv:2410.20771,
-
[10]
Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: 12 Preprint System Demonstrations, pp. 66–71, Brussels, Belgium, November
work page 2018
-
[11]
Association for Com- putational Linguistics. doi: 10.18653/v1/D18-2012. URLhttps://aclanthology.org/ D18-2012/. Dong Hoon Lee and Seunghoon Hong. Learning to merge tokens via decoupled embedding for efficient vision transformers.Advances in Neural Information Processing Systems, 37:54079– 54104,
work page internal anchor Pith review doi:10.18653/v1/d18-2012 2012
-
[12]
arXiv preprint arXiv:2202.07800 , year=
Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, and Guodong Guo. Q-vit: Ac- curate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022c. Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transfor...
-
[13]
Scalable Diffusion Models with Transformers
William Peebles and Saining Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
Neural machine translation of rare words with subword units
Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URLhttps://aclanthology.org/P16-1162/. Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,
-
[15]
Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024
Wilson Yan, Matei Zaharia, V olodymyr Mnih, Pieter Abbeel, Aleksandra Faust, and Hao Liu. Elas- tictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368,
-
[16]
MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities
Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,
work page internal anchor Pith review Pith/arXiv arXiv
-
[17]
arXiv preprint arXiv:2110.09408
Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. Hrformer: High-resolution transformer for dense prediction.arXiv preprint arXiv:2110.09408,
-
[18]
15 Preprint A IMPLEMENTATIONDETAILS Layer-level Merging Baselines.We used the official repositories for EViT(Liang et al., 2022a), ToMe(Bolya et al., 2022), and DTEM(Lee & Hong, 2024). Since implementations and experiments for ViT-L and ViT-H were not provided, we extended the code to include these two model configura- tions. Aside from adding the ViT-L a...
work page 2022
-
[19]
was used for training and evaluation, following prior works(Havtorn et al., 2023; Ronen et al., 2023). For the full fine-tuning experiment, we follow the exact training recipe of MAE (He et al., 2021), training VIT-B for 100 epochs and VIT-L for
work page 2023
-
[20]
All training was done with 8 GPUs and used batch size
We use a base learning rate of 1.5e-3 and use standard augmentations, namely RandAug (Cubuk et al., 2020), Random Erasing (Zhong et al., 2020), random flipping, and cropping. All training was done with 8 GPUs and used batch size
work page 2020
-
[21]
Patch sizes of 128, 64, and 32 were determined based on threshold values of 0.3, 2, and 2, respectively. Semantic Segmentation.We also utilized the official EV A-02 implementation along with its pre- trained checkpoints for semantic segmentation. The ADE20K dataset(Zhou et al., 2019
work page 2019
-
[22]
Although they perform similarly, the entropy scorer works better at higher token reductions. At higher token reductions, the Laplacian and upsampling- based scorers tend to remove more information that is critical to the model, which results in slightly worse performance. However, the differences are quite small and in practice we expect all three could b...
work page 2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.