Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

Carmelo Scribano; Danda Pani Paudel; Giorgia Franchini; Luc Van Gool; Marko Bertogna; Mohammad Mahdi; Nedyalko Prisadnikov; Yuqian Fu

arxiv: 2605.22132 · v1 · pith:IQGO64UQnew · submitted 2026-05-21 · 💻 cs.CV

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

Carmelo Scribano , Mohammad Mahdi , Nedyalko Prisadnikov , Yuqian Fu , Giorgia Franchini , Danda Pani Paudel , Marko Bertogna , Luc Van Gool This is my paper

Pith reviewed 2026-05-22 06:58 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision transformersdepthwise convolutionmodel accelerationinference speedupattention headsimage classificationsemantic segmentation

0 comments

The pith

Replacing selected attention heads in pretrained Vision Transformers with depthwise convolutions delivers 17-20% inference speedup with minimal performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to reduce the high inference costs of large Vision Transformer backbones in pretrained vision foundation models. It does this by identifying attention heads that already perform operations similar to convolutions and replacing them with lightweight depthwise convolution layers. Targeted identification strategies and a fine-tuning procedure help recover performance on downstream tasks. The result is faster execution on image classification and segmentation benchmarks without major accuracy drops. This matters because it makes strong pretrained models more usable on devices with limited compute resources.

Core claim

The authors establish that some attention heads in pretrained Vision Transformers exhibit intrinsic convolution-like behavior. They introduce an efficient depthwise convolution layer as a drop-in replacement for these heads, along with simple strategies to select which heads to replace and a fine-tuning procedure that recovers downstream task performance. This substitution achieves 17-20% inference speedup across image classification and segmentation tasks with minimal performance degradation.

What carries the argument

The efficient depthwise convolution-based layer serving as a drop-in replacement for convolution-like attention heads in the ViT backbone.

If this is right

Inference runs 17-20% faster on image classification tasks after the replacements.
Similar speed gains appear on segmentation tasks with only small accuracy changes.
Fine-tuning after replacement restores most original task performance.
Straightforward metrics identify which heads can be replaced without breaking the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The head-replacement idea could apply to other transformer families if comparable convolution-like heads appear there.
Pairing the method with quantization might produce larger total efficiency improvements.
Automatic detection of replaceable heads could remove the need for manual selection rules.

Load-bearing premise

Some attention heads in pretrained ViTs exhibit intrinsic convolution-like behavior that permits them to be replaced by an efficient depthwise convolution layer while preserving overall feature extraction capabilities.

What would settle it

Measure the cosine similarity between feature maps produced by the selected attention heads and by depthwise convolutions on identical inputs; low similarity for the chosen heads would indicate the replacement cannot preserve performance.

Figures

Figures reproduced from arXiv: 2605.22132 by Carmelo Scribano, Danda Pani Paudel, Giorgia Franchini, Luc Van Gool, Marko Bertogna, Mohammad Mahdi, Nedyalko Prisadnikov, Yuqian Fu.

**Figure 1.** Figure 1: Illustration of the proposed drop-in approximation. We replace attention (a) with a Depthwise convolution (b), which improves inference speed while reusing the pre-trained network parameters for performance. To address these challenges, in this paper, we propose an efficient, drop-in acceleration method for foundation ViTs. Building on previous research (Section 2.2), we assume that several Multi-head Sel… view at source ↗

**Figure 2.** Figure 2: Speedup vs number of heads replaced in blockwise and scattered setups. Results on ViT-L (24 blocks, 16 heads per block 336 × 336). 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Block 0 5 10 15 20 25 30 35 b/100 b b DSP (|S|= 12) DSP (|S|= 17) [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗

read the original abstract

Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows a workable drop-in swap of some ViT attention heads for depthwise convolutions that delivers 17-20% inference speedup after light fine-tuning, but the head selection process needs clearer independence from the replacement step itself.

read the letter

The main thing to know is that this work gives a concrete way to speed up already-trained large ViTs by replacing selected attention heads with depthwise convolution layers, then using targeted fine-tuning to get performance back. They report 17-20% inference gains on both classification and segmentation with only small accuracy drops, and they release code plus benchmarks to back it up.

Referee Report

1 major / 1 minor

Summary. The manuscript describes a method for accelerating pretrained Vision Transformers by replacing certain attention heads with depthwise convolution layers. The approach exploits what the authors describe as intrinsic convolution-like behavior in some heads, using simple strategies to select them and a fine-tuning procedure to maintain performance. The key result is a 17-20% inference speedup on classification and segmentation tasks with minimal degradation, supported by experiments and efficiency benchmarks. The code is made publicly available.

Significance. If validated, this could be a valuable contribution to efficient inference for vision foundation models, offering a practical acceleration technique that preserves the benefits of pretraining with limited additional training. The emphasis on drop-in replacement and public implementation supports potential adoption in the field.

major comments (1)

[Section 3.2] The description of the head selection strategies needs to explicitly demonstrate that identification is performed using only pretrained model characteristics without reference to post-substitution performance on downstream tasks. This is crucial to support the claim of intrinsic convolution-like behavior rather than an architecture search with recovery.

minor comments (1)

[Abstract] The phrase '17-20% percent' contains a redundant 'percent' and should be revised to '17-20%'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the emphasis on clarifying the head selection process to better support our claims regarding intrinsic convolution-like behavior. Below we address the major comment point by point.

read point-by-point responses

Referee: [Section 3.2] The description of the head selection strategies needs to explicitly demonstrate that identification is performed using only pretrained model characteristics without reference to post-substitution performance on downstream tasks. This is crucial to support the claim of intrinsic convolution-like behavior rather than an architecture search with recovery.

Authors: We agree that this distinction is important for substantiating the intrinsic nature of the observed behavior. The head selection strategies described in Section 3.2 are based exclusively on characteristics extracted from the pretrained model (e.g., analysis of attention weight distributions, token interaction patterns, and layer-wise statistics computed directly on the frozen pretrained weights and activations). No downstream task data, fine-tuning, or post-replacement accuracy measurements are used at any stage of identification. To address the request for explicit demonstration, we have revised Section 3.2 to include a new clarifying paragraph that states the selection criteria rely solely on pretrained model properties and explicitly notes the absence of any reference to post-substitution performance. We have also added a short proof-of-concept experiment in the revised section showing that the selected heads exhibit convolution-like properties when evaluated on the pretrained model alone, prior to any replacement or fine-tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical replacement technique with independent validation

full rationale

The paper describes an empirical method to identify and replace selected attention heads in pretrained ViTs with depthwise convolution layers, followed by fine-tuning to recover task performance. No equations, derivations, or self-citations are presented that reduce the claimed 17-20% speedup or performance preservation to a fitted parameter, renamed input, or load-bearing self-reference by construction. Head selection and replacement are framed as exploiting intrinsic pretrained behavior, with validation through experiments rather than a closed loop. The work is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that certain attention heads already compute something close to a depthwise convolution; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Some attention heads in pretrained ViTs exhibit intrinsic convolution-like behavior
This premise is required for the drop-in replacement to preserve feature quality.

pith-pipeline@v0.9.0 · 5693 in / 1205 out tokens · 41919 ms · 2026-05-22T06:58:48.874615+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We approximate attention by assuming that some heads can be replaced by input-independent kernels restricted to a local neighborhood Δk (Eq. 8); selection uses Σh = sum σEh (Eq. 15)
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We propose simple strategies to identify which heads can be replaced

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

[1]

In: The 2020 Conference on Empirical Methods in Natural Language Processing

Behnke, M., Heafield, K.: Losing heads in the lottery: Pruning transformer. In: The 2020 Conference on Empirical Methods in Natural Language Processing. pp. 2664–2674. Association for Computational Linguistics (ACL) (2020)

work page 2020
[2]

In: International Conference on Learning Represen- tations (2023)

Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Represen- tations (2023)

work page 2023
[3]

In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Brix, C., Bahar, P., Ney, H.: Successfully applying the stabilized lottery ticket hypothesis to the transformer architecture. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 3909–3915 (2020)

work page 2020
[4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cai, H., Li, J., Hu, M., Gan, C., Han, S.: Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17302–17313 (2023)

work page 2023
[5]

In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)

work page 2021
[6]

Advances in neural infor- mation processing systems33, 15834–15846 (2020)

Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y., Wang, Z., Carbin, M.: The lottery ticket hypothesis for pre-trained bert networks. Advances in neural infor- mation processing systems33, 15834–15846 (2020)

work page 2020
[7]

In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=HJlnC1rKPB

Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=HJlnC1rKPB

work page 2020
[8]

In: International Conference on Learning Representations (ICLR) (2024)

Dao, T.: FlashAttention-2: Faster attention with better parallelism and work par- titioning. In: International Conference on Learning Representations (ICLR) (2024)

work page 2024
[9]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

Dao,T.,Fu,D.Y.,Ermon,S.,Rudra,A.,Ré,C.:FlashAttention:Fastandmemory- efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

work page 2022
[10]

In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

work page 2019
[11]

In: International Conference on Learning Representations (2021), https://openreview

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021), https://openreview. net/forum?id=YicbFdNTTy

work page 2021
[12]

Graham,B.,El-Nouby,A.,Touvron,H.,Stock,P.,Joulin,A.,Jégou,H.,Douze,M.: Levit:avisiontransformerinconvnet’sclothingforfasterinference.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 12259–12269 (2021)

work page 2021
[13]

In: Inter- national Conference on Learning Representations (2022), https://openreview.net/ forum?id=L3_SsSNMmy

Han, Q., Fan, Z., Dai, Q., Sun, L., Cheng, M.M., Liu, J., Wang, J.: On the con- nection between local attention and dynamic depth-wise convolution. In: Inter- national Conference on Learning Representations (2022), https://openreview.net/ forum?id=L3_SsSNMmy

work page 2022
[14]

Advances in neural information processing systems5(1992)

Hassibi, B., Stork, D.: Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems5(1992)

work page 1992
[15]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 14 C

He, H., Cai, J., Liu, J., Pan, Z., Zhang, J., Tao, D., Zhuang, B.: Pruning self- attentions into convolutional layers in single path. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 14 C. Scribano et al

work page 2024
[16]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

work page 2022
[17]

Categorical Reparameterization with Gumbel-Softmax

Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016
[18]

Advances in Neural Information Processing Systems35, 37822–37836 (2022)

Jelassi, S., Sander, M., Li, Y.: Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems35, 37822–37836 (2022)

work page 2022
[19]

In: Chaudhuri, K., Salakhutdinov, R

Kool,W.,VanHoof,H.,Welling,M.:Stochasticbeamsandwheretofindthem:The Gumbel-top-k trick for sampling sequences without replacement. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 3499–

work page
[20]

PMLR (09–15 Jun 2019), https://proceedings.mlr.press/v97/kool19a.html

work page 2019
[21]

Block pruning for faster transformers

Lagunas, F., Charlaix, E., Sanh, V., Rush, A.M.: Block pruning for faster trans- formers. arXiv preprint arXiv:2109.04838 (2021)

work page arXiv 2021
[22]

Advances in neural infor- mation processing systems2(1989)

LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. Advances in neural infor- mation processing systems2(1989)

work page 1989
[23]

Transactions of the Association for Computational Linguistics9, 1442–1459 (2021)

Li,J.,Cotterell,R.,Sachan,M.:Differentiablesubsetpruningoftransformerheads. Transactions of the Association for Computational Linguistics9, 1442–1459 (2021)

work page 2021
[24]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Lin, S., Lyu, P., Liu, D., Tang, T., Liang, X., Song, A., Chang, X.: Mlp can be a good transformer learner. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 19489–19498 (2024)

work page 2024
[25]

Microsoft COCO: Common Objects in Context

Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRRabs/1405.0312(2014), http://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2014
[26]

arXiv preprint arXiv:2110.03860 (2021)

Marin, D., Chang, J.H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers. arXiv preprint arXiv:2110.03860 (2021)

work page arXiv 2021
[27]

In: International Conference on Learning Representa- tions (2022)

Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile- friendly vision transformer. In: International Conference on Learning Representa- tions (2022)

work page 2022
[28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Meng, L., Li, H., Chen, B.C., Lan, S., Wu, Z., Jiang, Y.G., Lim, S.N.: Adavit: Adaptive vision transformers for efficient image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12309– 12318 (2022)

work page 2022
[29]

Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Ad- vances in neural information processing systems32(2019)

work page 2019
[30]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

work page 2023
[31]

Available at SSRN 4529273 (2023)

Prasetyo, Y., Yudistira, N., Widodo, A.W.: Sparse then prune: Toward efficient vision transformers. Available at SSRN 4529273 (2023)

work page 2023
[32]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021
[33]

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Advances in neural informa- tion processing systems34, 12116–12128 (2021) Accelerating Vision Foundation Models with Drop-in Depthwise Convolution 15

work page 2021
[34]

Advances in neural infor- mation processing systems34, 13937–13949 (2021)

Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural infor- mation processing systems34, 13937–13949 (2021)

work page 2021
[35]

Berg and Li Fei-Fei , Title =

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV)115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015
[36]

Advances in neural information processing systems33, 20378–20389 (2020)

Sanh, V., Wolf, T., Rush, A.: Movement pruning: Adaptive sparsity by fine-tuning. Advances in neural information processing systems33, 20378–20389 (2020)

work page 2020
[37]

In: Proceedings of the IEEE/CVF winter conference on appli- cations of computer vision

Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: Attention with linear complexities. In: Proceedings of the IEEE/CVF winter conference on appli- cations of computer vision. pp. 3531–3539 (2021)

work page 2021
[38]

In: International Conference on Machine Learning

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. vol. 139, pp. 10347–10357 (July 2021)

work page 2021
[39]

Advances in neural information pro- cessing systems30(2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

work page 2017
[40]

In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 5797–5808 (2019)

work page 2019
[41]

Wadekar,S.N.,Chaurasia,A.:Mobilevitv3:Mobile-friendlyvisiontransformerwith simple and effective fusion of local, global and input features (2022)

work page 2022
[42]

Technometrics4(3), 419–420 (1962)

Welford, B.P.: Note on a method for calculating corrected sums of squares and products. Technometrics4(3), 419–420 (1962)

work page 1962
[43]

In: Pro- ceedings of the AAAI conference on artificial intelligence

Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V.: Nys- trömformer: A nyström-based algorithm for approximating self-attention. In: Pro- ceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 14138–14148 (2021)

work page 2021
[44]

In: European Conference on Computer Vision

Xu, K., Wang, Z., Chen, C., Geng, X., Lin, J., Yang, X., Wu, M., Li, X., Lin, W.: Lpvit: Low-power semi-structured pruning for vision transformers. In: European Conference on Computer Vision. pp. 269–287. Springer (2024)

work page 2024
[45]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, H., Yin, H., Shen, M., Molchanov, P., Li, H., Kautz, J.: Global vision trans- former pruning with hessian-aware saliency. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18547–18557 (2023)

work page 2023
[46]

In: Forty-first International Conference on Machine Learning (2024)

Yao, Z., Wang, J., Wu, H., Wang, J., Long, M.: Mobile attention: mobile-friendly linear-attention for vision transformers. In: Forty-first International Conference on Machine Learning (2024)

work page 2024
[47]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A-vit: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10809–10818 (2022)

work page 2022
[48]

In: International Conference on Learning Repre- sentations (2022), https://openreview.net/forum?id=9jsZiUgkCZP

Yu, S., Chen, T., Shen, J., Yuan, H., Tan, J., Yang, S., Liu, J., Wang, Z.: Unified visual transformer compression. In: International Conference on Learning Repre- sentations (2022), https://openreview.net/forum?id=9jsZiUgkCZP

work page 2022
[49]

Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

Zheng, C., Zhang, K., Yang, Z., Tan, W., Xiao, J., Ren, Y., Pu, S., et al.: Savit: Structure-aware vision transformer pruning via collaborative optimization. Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

work page 2022
[50]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

work page 2017

[1] [1]

In: The 2020 Conference on Empirical Methods in Natural Language Processing

Behnke, M., Heafield, K.: Losing heads in the lottery: Pruning transformer. In: The 2020 Conference on Empirical Methods in Natural Language Processing. pp. 2664–2674. Association for Computational Linguistics (ACL) (2020)

work page 2020

[2] [2]

In: International Conference on Learning Represen- tations (2023)

Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Represen- tations (2023)

work page 2023

[3] [3]

In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

Brix, C., Bahar, P., Ney, H.: Successfully applying the stabilized lottery ticket hypothesis to the transformer architecture. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 3909–3915 (2020)

work page 2020

[4] [4]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Cai, H., Li, J., Hu, M., Gan, C., Han, S.: Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17302–17313 (2023)

work page 2023

[5] [5]

In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)

Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)

work page 2021

[6] [6]

Advances in neural infor- mation processing systems33, 15834–15846 (2020)

Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y., Wang, Z., Carbin, M.: The lottery ticket hypothesis for pre-trained bert networks. Advances in neural infor- mation processing systems33, 15834–15846 (2020)

work page 2020

[7] [7]

In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=HJlnC1rKPB

Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=HJlnC1rKPB

work page 2020

[8] [8]

In: International Conference on Learning Representations (ICLR) (2024)

Dao, T.: FlashAttention-2: Faster attention with better parallelism and work par- titioning. In: International Conference on Learning Representations (ICLR) (2024)

work page 2024

[9] [9]

In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

Dao,T.,Fu,D.Y.,Ermon,S.,Rudra,A.,Ré,C.:FlashAttention:Fastandmemory- efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

work page 2022

[10] [10]

In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

work page 2019

[11] [11]

In: International Conference on Learning Representations (2021), https://openreview

Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021), https://openreview. net/forum?id=YicbFdNTTy

work page 2021

[12] [12]

Graham,B.,El-Nouby,A.,Touvron,H.,Stock,P.,Joulin,A.,Jégou,H.,Douze,M.: Levit:avisiontransformerinconvnet’sclothingforfasterinference.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 12259–12269 (2021)

work page 2021

[13] [13]

In: Inter- national Conference on Learning Representations (2022), https://openreview.net/ forum?id=L3_SsSNMmy

Han, Q., Fan, Z., Dai, Q., Sun, L., Cheng, M.M., Liu, J., Wang, J.: On the con- nection between local attention and dynamic depth-wise convolution. In: Inter- national Conference on Learning Representations (2022), https://openreview.net/ forum?id=L3_SsSNMmy

work page 2022

[14] [14]

Advances in neural information processing systems5(1992)

Hassibi, B., Stork, D.: Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems5(1992)

work page 1992

[15] [15]

IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 14 C

He, H., Cai, J., Liu, J., Pan, Z., Zhang, J., Tao, D., Zhuang, B.: Pruning self- attentions into convolutional layers in single path. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 14 C. Scribano et al

work page 2024

[16] [16]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

work page 2022

[17] [17]

Categorical Reparameterization with Gumbel-Softmax

Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)

work page internal anchor Pith review Pith/arXiv arXiv 2016

[18] [18]

Advances in Neural Information Processing Systems35, 37822–37836 (2022)

Jelassi, S., Sander, M., Li, Y.: Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems35, 37822–37836 (2022)

work page 2022

[19] [19]

In: Chaudhuri, K., Salakhutdinov, R

Kool,W.,VanHoof,H.,Welling,M.:Stochasticbeamsandwheretofindthem:The Gumbel-top-k trick for sampling sequences without replacement. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 3499–

work page

[20] [20]

PMLR (09–15 Jun 2019), https://proceedings.mlr.press/v97/kool19a.html

work page 2019

[21] [21]

Block pruning for faster transformers

Lagunas, F., Charlaix, E., Sanh, V., Rush, A.M.: Block pruning for faster trans- formers. arXiv preprint arXiv:2109.04838 (2021)

work page arXiv 2021

[22] [22]

Advances in neural infor- mation processing systems2(1989)

LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. Advances in neural infor- mation processing systems2(1989)

work page 1989

[23] [23]

Transactions of the Association for Computational Linguistics9, 1442–1459 (2021)

Li,J.,Cotterell,R.,Sachan,M.:Differentiablesubsetpruningoftransformerheads. Transactions of the Association for Computational Linguistics9, 1442–1459 (2021)

work page 2021

[24] [24]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

Lin, S., Lyu, P., Liu, D., Tang, T., Liang, X., Song, A., Chang, X.: Mlp can be a good transformer learner. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 19489–19498 (2024)

work page 2024

[25] [25]

Microsoft COCO: Common Objects in Context

Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRRabs/1405.0312(2014), http://arxiv.org/abs/1405.0312

work page internal anchor Pith review Pith/arXiv arXiv 2014

[26] [26]

arXiv preprint arXiv:2110.03860 (2021)

Marin, D., Chang, J.H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers. arXiv preprint arXiv:2110.03860 (2021)

work page arXiv 2021

[27] [27]

In: International Conference on Learning Representa- tions (2022)

Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile- friendly vision transformer. In: International Conference on Learning Representa- tions (2022)

work page 2022

[28] [28]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Meng, L., Li, H., Chen, B.C., Lan, S., Wu, Z., Jiang, Y.G., Lim, S.N.: Adavit: Adaptive vision transformers for efficient image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12309– 12318 (2022)

work page 2022

[29] [29]

Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Ad- vances in neural information processing systems32(2019)

work page 2019

[30] [30]

Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

work page 2023

[31] [31]

Available at SSRN 4529273 (2023)

Prasetyo, Y., Yudistira, N., Widodo, A.W.: Sparse then prune: Toward efficient vision transformers. Available at SSRN 4529273 (2023)

work page 2023

[32] [32]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021

[33] [33]

Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Advances in neural informa- tion processing systems34, 12116–12128 (2021) Accelerating Vision Foundation Models with Drop-in Depthwise Convolution 15

work page 2021

[34] [34]

Advances in neural infor- mation processing systems34, 13937–13949 (2021)

Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural infor- mation processing systems34, 13937–13949 (2021)

work page 2021

[35] [35]

Berg and Li Fei-Fei , Title =

Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV)115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

work page doi:10.1007/s11263-015-0816-y 2015

[36] [36]

Advances in neural information processing systems33, 20378–20389 (2020)

Sanh, V., Wolf, T., Rush, A.: Movement pruning: Adaptive sparsity by fine-tuning. Advances in neural information processing systems33, 20378–20389 (2020)

work page 2020

[37] [37]

In: Proceedings of the IEEE/CVF winter conference on appli- cations of computer vision

Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: Attention with linear complexities. In: Proceedings of the IEEE/CVF winter conference on appli- cations of computer vision. pp. 3531–3539 (2021)

work page 2021

[38] [38]

In: International Conference on Machine Learning

Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. vol. 139, pp. 10347–10357 (July 2021)

work page 2021

[39] [39]

Advances in neural information pro- cessing systems30(2017)

Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

work page 2017

[40] [40]

In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 5797–5808 (2019)

work page 2019

[41] [41]

Wadekar,S.N.,Chaurasia,A.:Mobilevitv3:Mobile-friendlyvisiontransformerwith simple and effective fusion of local, global and input features (2022)

work page 2022

[42] [42]

Technometrics4(3), 419–420 (1962)

Welford, B.P.: Note on a method for calculating corrected sums of squares and products. Technometrics4(3), 419–420 (1962)

work page 1962

[43] [43]

In: Pro- ceedings of the AAAI conference on artificial intelligence

Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V.: Nys- trömformer: A nyström-based algorithm for approximating self-attention. In: Pro- ceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 14138–14148 (2021)

work page 2021

[44] [44]

In: European Conference on Computer Vision

Xu, K., Wang, Z., Chen, C., Geng, X., Lin, J., Yang, X., Wu, M., Li, X., Lin, W.: Lpvit: Low-power semi-structured pruning for vision transformers. In: European Conference on Computer Vision. pp. 269–287. Springer (2024)

work page 2024

[45] [45]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yang, H., Yin, H., Shen, M., Molchanov, P., Li, H., Kautz, J.: Global vision trans- former pruning with hessian-aware saliency. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18547–18557 (2023)

work page 2023

[46] [46]

In: Forty-first International Conference on Machine Learning (2024)

Yao, Z., Wang, J., Wu, H., Wang, J., Long, M.: Mobile attention: mobile-friendly linear-attention for vision transformers. In: Forty-first International Conference on Machine Learning (2024)

work page 2024

[47] [47]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A-vit: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10809–10818 (2022)

work page 2022

[48] [48]

In: International Conference on Learning Repre- sentations (2022), https://openreview.net/forum?id=9jsZiUgkCZP

Yu, S., Chen, T., Shen, J., Yuan, H., Tan, J., Yang, S., Liu, J., Wang, Z.: Unified visual transformer compression. In: International Conference on Learning Repre- sentations (2022), https://openreview.net/forum?id=9jsZiUgkCZP

work page 2022

[49] [49]

Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

Zheng, C., Zhang, K., Yang, Z., Tan, W., Xiao, J., Ren, Y., Pu, S., et al.: Savit: Structure-aware vision transformer pruning via collaborative optimization. Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

work page 2022

[50] [50]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

work page 2017