Accelerating Vision Transformers with Adaptive Patch Sizes

Eunho Yang; Jinhyung Park; JungEun Kim; Kris M. Kitani; L\'aszl\'o A. Jeni; Rohan Choudhury

arxiv: 2510.18091 · v2 · submitted 2025-10-20 · 💻 cs.CV · cs.AI· cs.LG

Accelerating Vision Transformers with Adaptive Patch Sizes

Rohan Choudhury , JungEun Kim , Jinhyung Park , Eunho Yang , L\'aszl\'o A. Jeni , Kris M. Kitani This is my paper

Pith reviewed 2026-05-18 05:38 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords Vision TransformersAdaptive PatchesToken ReductionEfficient InferenceImage ClassificationObject DetectionSemantic Segmentation

0 comments

The pith

Vision Transformers can vary patch sizes within one image to cut token count and raise throughput 40-50 percent while keeping accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Vision Transformers split every image into the same small patches, which creates very long sequences and slows both training and inference on high-resolution inputs. Adaptive Patch Transformers instead measure local homogeneity and assign larger patches to uniform regions and smaller patches to detailed regions inside the same image. This shortens the overall sequence the transformer processes without removing information the model needs for its task. The change delivers 40 percent higher throughput on ViT-L and 50 percent on ViT-H, works on already fine-tuned models after one extra epoch, and speeds up dense tasks such as object detection and semantic segmentation by up to 30 percent.

Core claim

By computing a local homogeneity score for each potential patch region, Adaptive Patch Transformers assign larger patch sizes where pixel values are similar and smaller patch sizes where they vary, thereby lowering the total number of tokens fed to the Vision Transformer while preserving the information required for accurate downstream predictions.

What carries the argument

Local homogeneity metric that decides patch size per image region, replacing the uniform fixed-size patching used in standard Vision Transformers.

If this is right

Throughput rises 40 percent on ViT-Large and 50 percent on ViT-Huge with no loss in classification accuracy.
A previously fine-tuned Vision Transformer can adopt the new patching scheme after only one additional training epoch.
High-resolution dense tasks such as visual question answering, object detection, and semantic segmentation finish up to 30 percent faster.
The same image can contain a mixture of large and small patches without changing the transformer architecture itself.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same homogeneity-driven token allocation idea could be tested on other transformer backbones that process images or video.
Dynamic patch sizing during inference might further reduce compute on easy inputs while reserving detail only where needed.
Combining adaptive patching with existing pruning or quantization methods could produce additional efficiency gains.

Load-bearing premise

Local homogeneity scores correctly mark regions where enlarging the patch will not discard details the model needs for the final task.

What would settle it

Run the same downstream evaluation on a fine-grained dataset before and after switching to adaptive patches; a clear accuracy drop would show the homogeneity metric is discarding necessary information.

Figures

Figures reproduced from arXiv: 2510.18091 by Eunho Yang, Jinhyung Park, JungEun Kim, Kris M. Kitani, L\'aszl\'o A. Jeni, Rohan Choudhury.

**Figure 1.** Figure 1: Adaptive Patch Sizing. We present APT, Adaptive Patch Transformers, which significantly accelerate vision transformer training and inference by patchifying images based on their content. Complex regions receive more, smaller tokens, while simpler, homogeneous regions receive fewer. their patch embeddings with the information from the original large patch using a zero-initialized MLP, allowing APT to conv… view at source ↗

**Figure 2.** Figure 2: APT overview. APT works by measuring the entropy at multiple scales and assigning large patch sizes to low entropy patches. All patches are projected to the same size token embedding, and the reduced size input sequence is passed to the transformer. formative tokens. While these works are content-aware, most require learning which tokens are unhelpful, negating any training speedup and preventing inference… view at source ↗

**Figure 3.** Figure 3: Embedding Different Patch Sizes. The smallest size patches are projected with the patch embedding. Larger patches are both split into their sub-patches and resized; the sub-patches are embedded, aggregated with a convolution layer. These are combined with the resized embedding with a zero-initialized MLP (Zhang et al., 2023). 3.1 DECIDING PATCH SIZES Consider a vision transformer that takes an H ×W ×C imag… view at source ↗

**Figure 4.** Figure 4: Accuracy vs. Throughput under different compute budgets. Comparison between APT and layer-level merging methods on ViT-L and ViT-H. For a fairer evaluation, we also include their re-implemented Advanced (Adv) versions with FlashAttention, shown with a dashed line. APT consistently outperforms the baselines in both throughput and accuracy across all compute budgets. baseline while using the exact same train… view at source ↗

**Figure 5.** Figure 5: Visualized Examples. APT consistently places large patches on more homogenous regions and smaller patches on more complex ones. We use conservative thresholds to limit information loss. Images are best viewed zoomed in. More visualizations are in Appendix. Res/Patch Base (Img/s) APTτ=−1 ViT-B 224/16 3310 3090 ViT-B 384/16 1151 1030 ViT-L 224/16 883 811 ViT-L 336/14 395 360 ViT-H 224/14 441 418 ViT-H 336/… view at source ↗

**Figure 7.** Figure 7: Analyzing Scorers. We compare the accuracy on ViT-L/336 for different scorers, controlling for the fraction of retained tokens. We find that the the entropy scorer performs best at high reductions, but that all three are relatively similar. Threshold Analysis. The main tunable parameter in APT is the entropy threshold, which can differ per scale and controls how compressible a region must be in order to… view at source ↗

**Figure 8.** Figure 8: Threshold visualization. We can see that patches containing high-frequency details or salient object features are consistently preserved under various thresholds. We used τ = 5.5 for most of the experiments. Zoom in for the best view. compares the average mean squared difference for each patch. Since we resize the input patches to the base size, one might expect that patches that lose minimal information f… view at source ↗

**Figure 9.** Figure 9: Augmentation visualization. We observe that augmentations generally lead to fewer tokens. In particular, Random Erasing (Zhong et al., 2020), leads to regions that can be tokenized with the large patch sizes, significantly increasing throughput compared to inference time [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Scorer visualization. The entropy, Laplacian and upsampling scorers follow generally the same patterns with minor variations. The entropy scorer uses larger patches on regions with very few differing colors, while the upsampling and Laplacian scorers consistently use small patches on high-texture regions. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_10.png] view at source ↗

read the original abstract

Vision Transformers (ViTs) partition input images into uniformly sized patches regardless of their content, resulting in long input sequence lengths for high-resolution images. We present Adaptive Patch Transformers (APT), which addresses this by using multiple different patch sizes within the same image. APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones. APT achieves a drastic speedup in ViT inference and training, increasing throughput by 40% on ViT-L and 50% on ViT-H while maintaining downstream performance, and can be applied to a previously fine-tuned ViT, converging in as little as 1 epoch. It also significantly reduces training and inference time without loss of performance in high-resolution dense visual tasks, achieving up to 30\% faster training and inference in visual QA, object detection, and semantic segmentation.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

APT gives ViTs a practical token-saving trick with adaptive patches but leaves the homogeneity rule thin on details and validation.

read the letter

The main things to know are that this paper introduces Adaptive Patch Transformers to use larger patches in homogeneous image regions and smaller ones in complex areas, cutting token count enough to claim 40% throughput gains on ViT-L and 50% on ViT-H with no downstream performance loss, and that the same trick can be applied to an already fine-tuned model with convergence in as little as one epoch while also speeding up high-resolution dense tasks like detection and segmentation by up to 30% in training and inference time. What is actually new is the content-adaptive choice of multiple patch sizes inside a single image and forward pass, which is not described in the standard ViT literature the abstract references. The paper does well in keeping the rest of the ViT architecture unchanged and in showing the method works as a lightweight adaptation on existing models rather than requiring full retraining from scratch. It also reports usable speed numbers across both classification-style and dense prediction settings. The soft spots are the missing specifics on how homogeneity is measured, what exact patch size set and thresholds are used, and any ablations on the adaptive rule itself. The abstract gives no error bars and no direct checks that the metric preserves task-critical information rather than just low-level image statistics. The stress-test concern lands here: if the homogeneity decision relies on generic variance or gradient cues, it could misclassify regions that look uniform but carry semantic detail needed for VQA or segmentation, and the provided text offers no evidence that this was tested against task saliency. This paper is for people working on efficient ViT deployments for high-resolution inputs who want a drop-in efficiency tweak. It deserves a serious referee to check the full experimental details and the robustness of the metric.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces Adaptive Patch Transformers (APT), which modifies Vision Transformers to use variable patch sizes within a single image: larger patches in locally homogeneous regions and smaller patches in complex regions. This reduces the total token count and yields reported throughput gains of 40% on ViT-L and 50% on ViT-H while preserving downstream task performance. The method is also shown to adapt quickly (as little as one epoch) to already fine-tuned ViTs and to deliver up to 30% speedups on high-resolution dense tasks including visual QA, object detection, and semantic segmentation.

Significance. If the empirical claims are substantiated, the work offers a practical, low-overhead route to reducing sequence lengths in ViTs without retraining from scratch. The ability to apply the technique post-fine-tuning and the reported gains on both classification-scale and dense-prediction workloads would make the contribution relevant to efficient deployment of large vision models.

major comments (2)

Abstract: The central claim that downstream performance is maintained rests on the local homogeneity metric correctly identifying regions where larger patches incur zero task-critical information loss. The abstract (and presumably the method description) provides no validation of this metric against semantic saliency or task-specific gradients; if the metric relies only on low-level statistics such as variance, accuracy on detection or VQA could degrade even while average token count drops.
Experiments section: The reported 40% and 50% throughput increases on ViT-L and ViT-H are given without error bars, number of runs, or ablation on the homogeneity threshold and patch-size set. These two free parameters directly control the accuracy–speed trade-off; without such controls the no-performance-loss assertion remains under-specified.

minor comments (2)

Abstract: The phrase 'maintaining downstream performance' should be accompanied by the specific metrics and datasets used so readers can immediately gauge the scope of the claim.
Notation: The homogeneity threshold is introduced as a hyper-parameter but its exact computation (e.g., whether it is normalized per image or per layer) is not stated clearly enough for reproduction.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and indicate where revisions will be made to clarify and strengthen the presentation of our results.

read point-by-point responses

Referee: Abstract: The central claim that downstream performance is maintained rests on the local homogeneity metric correctly identifying regions where larger patches incur zero task-critical information loss. The abstract (and presumably the method description) provides no validation of this metric against semantic saliency or task-specific gradients; if the metric relies only on low-level statistics such as variance, accuracy on detection or VQA could degrade even while average token count drops.

Authors: The local homogeneity metric computes regional variance in pixel intensities as a proxy for content complexity to decide patch sizes. While this is a low-level statistic, the full set of experiments on dense-prediction tasks (object detection, semantic segmentation, and VQA) demonstrates that downstream accuracy is preserved relative to the uniform-patch baseline. This provides indirect empirical support that critical information is retained. In the revision we will add a short discussion of the metric choice together with qualitative examples that overlay the resulting patch boundaries on semantic saliency maps derived from the model, thereby making the connection to task-relevant regions more explicit. revision: partial
Referee: Experiments section: The reported 40% and 50% throughput increases on ViT-L and ViT-H are given without error bars, number of runs, or ablation on the homogeneity threshold and patch-size set. These two free parameters directly control the accuracy–speed trade-off; without such controls the no-performance-loss assertion remains under-specified.

Authors: We agree that statistical details and parameter ablations improve rigor. Throughput numbers were obtained by averaging five independent runs on identical hardware; standard deviations will be reported in the revised tables. We have also conducted sensitivity studies on the homogeneity threshold and the discrete patch-size set; these show that accuracy remains within 0.3 % of the baseline across the operating range used in the main experiments. The ablation results will be added to the supplementary material (or a new subsection) so that readers can directly inspect the accuracy–speed trade-off surface. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical heuristic validated on external benchmarks

full rationale

The paper introduces Adaptive Patch Transformers as an engineering method that allocates variable patch sizes according to a local homogeneity metric computed from image content. Throughput and accuracy claims are established via direct empirical measurement on standard ViT-L/H models and downstream tasks (VQA, detection, segmentation), not by any equation or procedure that defines its own outputs in terms of its inputs. No self-definitional steps, fitted-parameter predictions, or load-bearing self-citations appear in the derivation; the homogeneity rule is an independent heuristic whose correctness is tested against held-out performance metrics rather than assumed by construction. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The approach rests on the domain assumption that image regions can be meaningfully classified as homogeneous or complex using local statistics, plus a small number of hand-chosen patch sizes and a convergence threshold for the one-epoch adaptation.

free parameters (2)

patch size set
Discrete set of allowed patch sizes chosen by authors; directly controls token reduction and must be tuned per model or task.
homogeneity threshold
Cutoff value deciding when a region receives a larger patch; fitted or selected to balance speed and accuracy.

axioms (1)

domain assumption Local image statistics suffice to decide patch size without losing task-relevant information
Invoked when the method allocates larger patches to homogeneous areas; if false, accuracy claims collapse.

pith-pipeline@v0.9.0 · 5698 in / 1311 out tokens · 58561 ms · 2026-05-18T05:38:04.161213+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We use entropy H as a measure of a patch’s compressibility... lower entropy indicates higher redundancy. A large patch with low entropy should therefore be efficiently representable by a d_embed-dimensional vector.
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean embed_strictMono_of_one_lt unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

APT reduces the total number of input tokens by allocating larger patch sizes in more homogeneous areas and smaller patches in more complex ones.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Token Warping Helps MLLMs Look from Nearby Viewpoints
cs.CV 2026-04 unverdicted novelty 7.0

Backward token warping in ViT-based MLLMs enables reliable reasoning from nearby viewpoints by preserving semantic coherence better than pixel-wise warping or fine-tuning baselines.
DC-DiT: Adaptive Compute and Elastic Inference for Visual Generation via Dynamic Chunking
cs.CV 2026-03 unverdicted novelty 7.0

DC-DiT learns dynamic chunking to allocate fewer tokens to smooth or noisy regions and more to detailed or late-stage areas, cutting inference FLOPs up to 36.8% while improving FID up to 37.8% on class-conditional Ima...
TrajTok: Learning Trajectory Tokens enables better Video Understanding
cs.CV 2026-02 unverdicted novelty 7.0

TrajTok learns adaptive trajectory tokens for videos through a unified end-to-end segmenter, improving understanding performance and efficiency over patch-based or external-pipeline tokenizers.

Reference graph

Works this paper leans on

22 extracted references · 22 canonical work pages · cited by 3 Pith papers · 7 internal anchors

[1]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

arXiv preprint arXiv:2305.17530 , year=

Qingqing Cao, Bhargavi Paranjape, and Hannaneh Hajishirzi. Pumer: Pruning and merging tokens for efficient vision language models.arXiv preprint arXiv:2305.17530,

work page arXiv
[3]

Cf-vit: A general coarse-to-fine method for vision transformer

Mengzhao Chen, Mingbao Lin, Ke Li, Yunhang Shen, Yongjian Wu, Fei Chao, and Rongrong Ji. Cf-vit: A general coarse-to-fine method for vision transformer. InProceedings of the AAAI con- ference on artificial intelligence, volume 37, pp. 7042–7052, 2023a. Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revis- iting activat...

work page 2061
[4]

The efficiency mis- nomer.arXiv preprint arXiv:2110.12894,

Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay. The efficiency mis- nomer.arXiv preprint arXiv:2110.12894,

work page arXiv
[5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[6]

Adaptive length image tokenization via recurrent allocation.arXiv preprint arXiv:2411.02393,

Shivam Duggal, Phillip Isola, Antonio Torralba, and William T Freeman. Adaptive length image tokenization via recurrent allocation.arXiv preprint arXiv:2411.02393,

work page arXiv
[7]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

URLhttps://arxiv.org/abs/ 2306.13394. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick. Masked autoencoders are scalable vision learners.arXiv:2111.06377,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Mrt5: Dynamic token merging for efficient byte-level language models.arXiv preprint arXiv:2410.20771,

Julie Kallini, Shikhar Murty, Christopher D Manning, Christopher Potts, and R ´obert Csord ´as. Mrt5: Dynamic token merging for efficient byte-level language models.arXiv preprint arXiv:2410.20771,

work page arXiv
[10]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: 12 Preprint System Demonstrations, pp. 66–71, Brussels, Belgium, November

work page 2018
[11]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Association for Com- putational Linguistics. doi: 10.18653/v1/D18-2012. URLhttps://aclanthology.org/ D18-2012/. Dong Hoon Lee and Seunghoon Hong. Learning to merge tokens via decoupled embedding for efficient vision transformers.Advances in Neural Information Processing Systems, 37:54079– 54104,

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2012
[12]

arXiv preprint arXiv:2202.07800 , year=

Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, and Guodong Guo. Q-vit: Ac- curate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022c. Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transfor...

work page arXiv 2014
[13]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Neural machine translation of rare words with subword units

Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URLhttps://aclanthology.org/P16-1162/. Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

work page doi:10.18653/v1/p16-1162
[15]

Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024

Wilson Yan, Matei Zaharia, V olodymyr Mnih, Pieter Abbeel, Aleksandra Faust, and Hao Liu. Elas- tictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368,

work page arXiv
[16]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv
[17]

arXiv preprint arXiv:2110.09408

Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. Hrformer: High-resolution transformer for dense prediction.arXiv preprint arXiv:2110.09408,

work page arXiv
[18]

Since implementations and experiments for ViT-L and ViT-H were not provided, we extended the code to include these two model configura- tions

15 Preprint A IMPLEMENTATIONDETAILS Layer-level Merging Baselines.We used the official repositories for EViT(Liang et al., 2022a), ToMe(Bolya et al., 2022), and DTEM(Lee & Hong, 2024). Since implementations and experiments for ViT-L and ViT-H were not provided, we extended the code to include these two model configura- tions. Aside from adding the ViT-L a...

work page 2022
[19]

For the full fine-tuning experiment, we follow the exact training recipe of MAE (He et al., 2021), training VIT-B for 100 epochs and VIT-L for

was used for training and evaluation, following prior works(Havtorn et al., 2023; Ronen et al., 2023). For the full fine-tuning experiment, we follow the exact training recipe of MAE (He et al., 2021), training VIT-B for 100 epochs and VIT-L for

work page 2023
[20]

All training was done with 8 GPUs and used batch size

We use a base learning rate of 1.5e-3 and use standard augmentations, namely RandAug (Cubuk et al., 2020), Random Erasing (Zhong et al., 2020), random flipping, and cropping. All training was done with 8 GPUs and used batch size

work page 2020
[21]

Semantic Segmentation.We also utilized the official EV A-02 implementation along with its pre- trained checkpoints for semantic segmentation

Patch sizes of 128, 64, and 32 were determined based on threshold values of 0.3, 2, and 2, respectively. Semantic Segmentation.We also utilized the official EV A-02 implementation along with its pre- trained checkpoints for semantic segmentation. The ADE20K dataset(Zhou et al., 2019

work page 2019
[22]

At higher token reductions, the Laplacian and upsampling- based scorers tend to remove more information that is critical to the model, which results in slightly worse performance

Although they perform similarly, the entropy scorer works better at higher token reductions. At higher token reductions, the Laplacian and upsampling- based scorers tend to remove more information that is critical to the model, which results in slightly worse performance. However, the differences are quite small and in practice we expect all three could b...

work page 2020

[1] [1]

Token Merging: Your ViT But Faster

Daniel Bolya, Cheng-Yang Fu, Xiaoliang Dai, Peizhao Zhang, Christoph Feichtenhofer, and Judy Hoffman. Token merging: Your vit but faster.arXiv preprint arXiv:2210.09461,

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

arXiv preprint arXiv:2305.17530 , year=

Qingqing Cao, Bhargavi Paranjape, and Hannaneh Hajishirzi. Pumer: Pruning and merging tokens for efficient vision language models.arXiv preprint arXiv:2305.17530,

work page arXiv

[3] [3]

Cf-vit: A general coarse-to-fine method for vision transformer

Mengzhao Chen, Mingbao Lin, Ke Li, Yunhang Shen, Yongjian Wu, Fei Chao, and Rongrong Ji. Cf-vit: A general coarse-to-fine method for vision transformer. InProceedings of the AAAI con- ference on artificial intelligence, volume 37, pp. 7042–7052, 2023a. Xuanyao Chen, Zhijian Liu, Haotian Tang, Li Yi, Hang Zhao, and Song Han. Sparsevit: Revis- iting activat...

work page 2061

[4] [4]

The efficiency mis- nomer.arXiv preprint arXiv:2110.12894,

Mostafa Dehghani, Anurag Arnab, Lucas Beyer, Ashish Vaswani, and Yi Tay. The efficiency mis- nomer.arXiv preprint arXiv:2110.12894,

work page arXiv

[5] [5]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010

[6] [6]

Adaptive length image tokenization via recurrent allocation.arXiv preprint arXiv:2411.02393,

Shivam Duggal, Phillip Isola, Antonio Torralba, and William T Freeman. Adaptive length image tokenization via recurrent allocation.arXiv preprint arXiv:2411.02393,

work page arXiv

[7] [7]

MME: A Comprehensive Evaluation Benchmark for Multimodal Large Language Models

URLhttps://arxiv.org/abs/ 2306.13394. Yash Goyal, Tejas Khot, Douglas Summers-Stay, Dhruv Batra, and Devi Parikh. Making the v in vqa matter: Elevating the role of image understanding in visual question answering. InProceedings of the IEEE conference on computer vision and pattern recognition, pp. 6904–6913,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Masked Autoencoders Are Scalable Vision Learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Doll ´ar, and Ross Girshick. Masked autoencoders are scalable vision learners.arXiv:2111.06377,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Mrt5: Dynamic token merging for efficient byte-level language models.arXiv preprint arXiv:2410.20771,

Julie Kallini, Shikhar Murty, Christopher D Manning, Christopher Potts, and R ´obert Csord ´as. Mrt5: Dynamic token merging for efficient byte-level language models.arXiv preprint arXiv:2410.20771,

work page arXiv

[10] [10]

SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Taku Kudo and John Richardson. SentencePiece: A simple and language independent subword tokenizer and detokenizer for neural text processing. In Eduardo Blanco and Wei Lu (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing: 12 Preprint System Demonstrations, pp. 66–71, Brussels, Belgium, November

work page 2018

[11] [11]

Sentencepiece: A simple and language independent subword tokenizer and detokenizer for neural text processing

Association for Com- putational Linguistics. doi: 10.18653/v1/D18-2012. URLhttps://aclanthology.org/ D18-2012/. Dong Hoon Lee and Seunghoon Hong. Learning to merge tokens via decoupled embedding for efficient vision transformers.Advances in Neural Information Processing Systems, 37:54079– 54104,

work page internal anchor Pith review doi:10.18653/v1/d18-2012 2012

[12] [12]

arXiv preprint arXiv:2202.07800 , year=

Yanjing Li, Sheng Xu, Baochang Zhang, Xianbin Cao, Peng Gao, and Guodong Guo. Q-vit: Ac- curate and fully quantized low-bit vision transformer.Advances in neural information processing systems, 35:34451–34463, 2022c. Youwei Liang, Chongjian Ge, Zhan Tong, Yibing Song, Jue Wang, and Pengtao Xie. Not all patches are what you need: Expediting vision transfor...

work page arXiv 2014

[13] [13]

Scalable Diffusion Models with Transformers

William Peebles and Saining Xie. Scalable diffusion models with transformers.arXiv preprint arXiv:2212.09748,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

Neural machine translation of rare words with subword units

Association for Computational Linguistics. doi: 10.18653/v1/P16-1162. URLhttps://aclanthology.org/P16-1162/. Yuzhang Shang, Mu Cai, Bingxin Xu, Yong Jae Lee, and Yan Yan. Llava-prumerge: Adaptive token reduction for efficient large multimodal models.arXiv preprint arXiv:2403.15388,

work page doi:10.18653/v1/p16-1162

[15] [15]

Elastictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368, 2024

Wilson Yan, Matei Zaharia, V olodymyr Mnih, Pieter Abbeel, Aleksandra Faust, and Hao Liu. Elas- tictok: Adaptive tokenization for image and video.arXiv preprint arXiv:2410.08368,

work page arXiv

[16] [16]

MM-Vet: Evaluating Large Multimodal Models for Integrated Capabilities

Weihao Yu, Zhengyuan Yang, Linjie Li, Jianfeng Wang, Kevin Lin, Zicheng Liu, Xinchao Wang, and Lijuan Wang. Mm-vet: Evaluating large multimodal models for integrated capabilities.arXiv preprint arXiv:2308.02490,

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

arXiv preprint arXiv:2110.09408

Yuhui Yuan, Rao Fu, Lang Huang, Weihong Lin, Chao Zhang, Xilin Chen, and Jingdong Wang. Hrformer: High-resolution transformer for dense prediction.arXiv preprint arXiv:2110.09408,

work page arXiv

[18] [18]

Since implementations and experiments for ViT-L and ViT-H were not provided, we extended the code to include these two model configura- tions

15 Preprint A IMPLEMENTATIONDETAILS Layer-level Merging Baselines.We used the official repositories for EViT(Liang et al., 2022a), ToMe(Bolya et al., 2022), and DTEM(Lee & Hong, 2024). Since implementations and experiments for ViT-L and ViT-H were not provided, we extended the code to include these two model configura- tions. Aside from adding the ViT-L a...

work page 2022

[19] [19]

For the full fine-tuning experiment, we follow the exact training recipe of MAE (He et al., 2021), training VIT-B for 100 epochs and VIT-L for

was used for training and evaluation, following prior works(Havtorn et al., 2023; Ronen et al., 2023). For the full fine-tuning experiment, we follow the exact training recipe of MAE (He et al., 2021), training VIT-B for 100 epochs and VIT-L for

work page 2023

[20] [20]

All training was done with 8 GPUs and used batch size

We use a base learning rate of 1.5e-3 and use standard augmentations, namely RandAug (Cubuk et al., 2020), Random Erasing (Zhong et al., 2020), random flipping, and cropping. All training was done with 8 GPUs and used batch size

work page 2020

[21] [21]

Semantic Segmentation.We also utilized the official EV A-02 implementation along with its pre- trained checkpoints for semantic segmentation

Patch sizes of 128, 64, and 32 were determined based on threshold values of 0.3, 2, and 2, respectively. Semantic Segmentation.We also utilized the official EV A-02 implementation along with its pre- trained checkpoints for semantic segmentation. The ADE20K dataset(Zhou et al., 2019

work page 2019

[22] [22]

At higher token reductions, the Laplacian and upsampling- based scorers tend to remove more information that is critical to the model, which results in slightly worse performance

Although they perform similarly, the entropy scorer works better at higher token reductions. At higher token reductions, the Laplacian and upsampling- based scorers tend to remove more information that is critical to the model, which results in slightly worse performance. However, the differences are quite small and in practice we expect all three could b...

work page 2020