Accelerating Vision Foundation Models with Drop-in Depthwise Convolution
Pith reviewed 2026-05-22 06:58 UTC · model grok-4.3
The pith
Replacing selected attention heads in pretrained Vision Transformers with depthwise convolutions delivers 17-20% inference speedup with minimal performance loss.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The authors establish that some attention heads in pretrained Vision Transformers exhibit intrinsic convolution-like behavior. They introduce an efficient depthwise convolution layer as a drop-in replacement for these heads, along with simple strategies to select which heads to replace and a fine-tuning procedure that recovers downstream task performance. This substitution achieves 17-20% inference speedup across image classification and segmentation tasks with minimal performance degradation.
What carries the argument
The efficient depthwise convolution-based layer serving as a drop-in replacement for convolution-like attention heads in the ViT backbone.
If this is right
- Inference runs 17-20% faster on image classification tasks after the replacements.
- Similar speed gains appear on segmentation tasks with only small accuracy changes.
- Fine-tuning after replacement restores most original task performance.
- Straightforward metrics identify which heads can be replaced without breaking the model.
Where Pith is reading between the lines
- The head-replacement idea could apply to other transformer families if comparable convolution-like heads appear there.
- Pairing the method with quantization might produce larger total efficiency improvements.
- Automatic detection of replaceable heads could remove the need for manual selection rules.
Load-bearing premise
Some attention heads in pretrained ViTs exhibit intrinsic convolution-like behavior that permits them to be replaced by an efficient depthwise convolution layer while preserving overall feature extraction capabilities.
What would settle it
Measure the cosine similarity between feature maps produced by the selected attention heads and by depthwise convolutions on identical inputs; low similarity for the chosen heads would indicate the replacement cannot preserve performance.
Figures
read the original abstract
Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript describes a method for accelerating pretrained Vision Transformers by replacing certain attention heads with depthwise convolution layers. The approach exploits what the authors describe as intrinsic convolution-like behavior in some heads, using simple strategies to select them and a fine-tuning procedure to maintain performance. The key result is a 17-20% inference speedup on classification and segmentation tasks with minimal degradation, supported by experiments and efficiency benchmarks. The code is made publicly available.
Significance. If validated, this could be a valuable contribution to efficient inference for vision foundation models, offering a practical acceleration technique that preserves the benefits of pretraining with limited additional training. The emphasis on drop-in replacement and public implementation supports potential adoption in the field.
major comments (1)
- [Section 3.2] The description of the head selection strategies needs to explicitly demonstrate that identification is performed using only pretrained model characteristics without reference to post-substitution performance on downstream tasks. This is crucial to support the claim of intrinsic convolution-like behavior rather than an architecture search with recovery.
minor comments (1)
- [Abstract] The phrase '17-20% percent' contains a redundant 'percent' and should be revised to '17-20%'.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We appreciate the emphasis on clarifying the head selection process to better support our claims regarding intrinsic convolution-like behavior. Below we address the major comment point by point.
read point-by-point responses
-
Referee: [Section 3.2] The description of the head selection strategies needs to explicitly demonstrate that identification is performed using only pretrained model characteristics without reference to post-substitution performance on downstream tasks. This is crucial to support the claim of intrinsic convolution-like behavior rather than an architecture search with recovery.
Authors: We agree that this distinction is important for substantiating the intrinsic nature of the observed behavior. The head selection strategies described in Section 3.2 are based exclusively on characteristics extracted from the pretrained model (e.g., analysis of attention weight distributions, token interaction patterns, and layer-wise statistics computed directly on the frozen pretrained weights and activations). No downstream task data, fine-tuning, or post-replacement accuracy measurements are used at any stage of identification. To address the request for explicit demonstration, we have revised Section 3.2 to include a new clarifying paragraph that states the selection criteria rely solely on pretrained model properties and explicitly notes the absence of any reference to post-substitution performance. We have also added a short proof-of-concept experiment in the revised section showing that the selected heads exhibit convolution-like properties when evaluated on the pretrained model alone, prior to any replacement or fine-tuning. revision: yes
Circularity Check
No circularity: empirical replacement technique with independent validation
full rationale
The paper describes an empirical method to identify and replace selected attention heads in pretrained ViTs with depthwise convolution layers, followed by fine-tuning to recover task performance. No equations, derivations, or self-citations are presented that reduce the claimed 17-20% speedup or performance preservation to a fitted parameter, renamed input, or load-bearing self-reference by construction. Head selection and replacement are framed as exploiting intrinsic pretrained behavior, with validation through experiments rather than a closed loop. The work is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Some attention heads in pretrained ViTs exhibit intrinsic convolution-like behavior
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We approximate attention by assuming that some heads can be replaced by input-independent kernels restricted to a local neighborhood Δk (Eq. 8); selection uses Σh = sum σEh (Eq. 15)
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
We propose simple strategies to identify which heads can be replaced
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
In: The 2020 Conference on Empirical Methods in Natural Language Processing
Behnke, M., Heafield, K.: Losing heads in the lottery: Pruning transformer. In: The 2020 Conference on Empirical Methods in Natural Language Processing. pp. 2664–2674. Association for Computational Linguistics (ACL) (2020)
work page 2020
-
[2]
In: International Conference on Learning Represen- tations (2023)
Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Represen- tations (2023)
work page 2023
-
[3]
In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics
Brix, C., Bahar, P., Ney, H.: Successfully applying the stabilized lottery ticket hypothesis to the transformer architecture. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 3909–3915 (2020)
work page 2020
-
[4]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Cai, H., Li, J., Hu, M., Gan, C., Han, S.: Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17302–17313 (2023)
work page 2023
-
[5]
In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)
work page 2021
-
[6]
Advances in neural infor- mation processing systems33, 15834–15846 (2020)
Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y., Wang, Z., Carbin, M.: The lottery ticket hypothesis for pre-trained bert networks. Advances in neural infor- mation processing systems33, 15834–15846 (2020)
work page 2020
-
[7]
Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=HJlnC1rKPB
work page 2020
-
[8]
In: International Conference on Learning Representations (ICLR) (2024)
Dao, T.: FlashAttention-2: Faster attention with better parallelism and work par- titioning. In: International Conference on Learning Representations (ICLR) (2024)
work page 2024
-
[9]
In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
Dao,T.,Fu,D.Y.,Ermon,S.,Rudra,A.,Ré,C.:FlashAttention:Fastandmemory- efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)
work page 2022
-
[10]
Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)
work page 2019
-
[11]
In: International Conference on Learning Representations (2021), https://openreview
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021), https://openreview. net/forum?id=YicbFdNTTy
work page 2021
-
[12]
Graham,B.,El-Nouby,A.,Touvron,H.,Stock,P.,Joulin,A.,Jégou,H.,Douze,M.: Levit:avisiontransformerinconvnet’sclothingforfasterinference.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 12259–12269 (2021)
work page 2021
-
[13]
Han, Q., Fan, Z., Dai, Q., Sun, L., Cheng, M.M., Liu, J., Wang, J.: On the con- nection between local attention and dynamic depth-wise convolution. In: Inter- national Conference on Learning Representations (2022), https://openreview.net/ forum?id=L3_SsSNMmy
work page 2022
-
[14]
Advances in neural information processing systems5(1992)
Hassibi, B., Stork, D.: Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems5(1992)
work page 1992
-
[15]
IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 14 C
He, H., Cai, J., Liu, J., Pan, Z., Zhang, J., Tao, D., Zhuang, B.: Pruning self- attentions into convolutional layers in single path. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 14 C. Scribano et al
work page 2024
-
[16]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)
work page 2022
-
[17]
Categorical Reparameterization with Gumbel-Softmax
Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[18]
Advances in Neural Information Processing Systems35, 37822–37836 (2022)
Jelassi, S., Sander, M., Li, Y.: Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems35, 37822–37836 (2022)
work page 2022
-
[19]
In: Chaudhuri, K., Salakhutdinov, R
Kool,W.,VanHoof,H.,Welling,M.:Stochasticbeamsandwheretofindthem:The Gumbel-top-k trick for sampling sequences without replacement. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 3499–
-
[20]
PMLR (09–15 Jun 2019), https://proceedings.mlr.press/v97/kool19a.html
work page 2019
-
[21]
Block pruning for faster transformers
Lagunas, F., Charlaix, E., Sanh, V., Rush, A.M.: Block pruning for faster trans- formers. arXiv preprint arXiv:2109.04838 (2021)
-
[22]
Advances in neural infor- mation processing systems2(1989)
LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. Advances in neural infor- mation processing systems2(1989)
work page 1989
-
[23]
Transactions of the Association for Computational Linguistics9, 1442–1459 (2021)
Li,J.,Cotterell,R.,Sachan,M.:Differentiablesubsetpruningoftransformerheads. Transactions of the Association for Computational Linguistics9, 1442–1459 (2021)
work page 2021
-
[24]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition
Lin, S., Lyu, P., Liu, D., Tang, T., Liang, X., Song, A., Chang, X.: Mlp can be a good transformer learner. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 19489–19498 (2024)
work page 2024
-
[25]
Microsoft COCO: Common Objects in Context
Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRRabs/1405.0312(2014), http://arxiv.org/abs/1405.0312
work page internal anchor Pith review Pith/arXiv arXiv 2014
-
[26]
arXiv preprint arXiv:2110.03860 (2021)
Marin, D., Chang, J.H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers. arXiv preprint arXiv:2110.03860 (2021)
-
[27]
In: International Conference on Learning Representa- tions (2022)
Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile- friendly vision transformer. In: International Conference on Learning Representa- tions (2022)
work page 2022
-
[28]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Meng, L., Li, H., Chen, B.C., Lan, S., Wu, Z., Jiang, Y.G., Lim, S.N.: Adavit: Adaptive vision transformers for efficient image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12309– 12318 (2022)
work page 2022
-
[29]
Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Ad- vances in neural information processing systems32(2019)
work page 2019
-
[30]
Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...
work page 2023
-
[31]
Available at SSRN 4529273 (2023)
Prasetyo, Y., Yudistira, N., Widodo, A.W.: Sparse then prune: Toward efficient vision transformers. Available at SSRN 4529273 (2023)
work page 2023
-
[32]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
work page 2021
-
[33]
Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Advances in neural informa- tion processing systems34, 12116–12128 (2021) Accelerating Vision Foundation Models with Drop-in Depthwise Convolution 15
work page 2021
-
[34]
Advances in neural infor- mation processing systems34, 13937–13949 (2021)
Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural infor- mation processing systems34, 13937–13949 (2021)
work page 2021
-
[35]
Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV)115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y
-
[36]
Advances in neural information processing systems33, 20378–20389 (2020)
Sanh, V., Wolf, T., Rush, A.: Movement pruning: Adaptive sparsity by fine-tuning. Advances in neural information processing systems33, 20378–20389 (2020)
work page 2020
-
[37]
In: Proceedings of the IEEE/CVF winter conference on appli- cations of computer vision
Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: Attention with linear complexities. In: Proceedings of the IEEE/CVF winter conference on appli- cations of computer vision. pp. 3531–3539 (2021)
work page 2021
-
[38]
In: International Conference on Machine Learning
Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. vol. 139, pp. 10347–10357 (July 2021)
work page 2021
-
[39]
Advances in neural information pro- cessing systems30(2017)
Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)
work page 2017
-
[40]
In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics
Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 5797–5808 (2019)
work page 2019
-
[41]
Wadekar,S.N.,Chaurasia,A.:Mobilevitv3:Mobile-friendlyvisiontransformerwith simple and effective fusion of local, global and input features (2022)
work page 2022
-
[42]
Technometrics4(3), 419–420 (1962)
Welford, B.P.: Note on a method for calculating corrected sums of squares and products. Technometrics4(3), 419–420 (1962)
work page 1962
-
[43]
In: Pro- ceedings of the AAAI conference on artificial intelligence
Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V.: Nys- trömformer: A nyström-based algorithm for approximating self-attention. In: Pro- ceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 14138–14148 (2021)
work page 2021
-
[44]
In: European Conference on Computer Vision
Xu, K., Wang, Z., Chen, C., Geng, X., Lin, J., Yang, X., Wu, M., Li, X., Lin, W.: Lpvit: Low-power semi-structured pruning for vision transformers. In: European Conference on Computer Vision. pp. 269–287. Springer (2024)
work page 2024
-
[45]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yang, H., Yin, H., Shen, M., Molchanov, P., Li, H., Kautz, J.: Global vision trans- former pruning with hessian-aware saliency. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18547–18557 (2023)
work page 2023
-
[46]
In: Forty-first International Conference on Machine Learning (2024)
Yao, Z., Wang, J., Wu, H., Wang, J., Long, M.: Mobile attention: mobile-friendly linear-attention for vision transformers. In: Forty-first International Conference on Machine Learning (2024)
work page 2024
-
[47]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A-vit: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10809–10818 (2022)
work page 2022
-
[48]
Yu, S., Chen, T., Shen, J., Yuan, H., Tan, J., Yang, S., Liu, J., Wang, Z.: Unified visual transformer compression. In: International Conference on Learning Repre- sentations (2022), https://openreview.net/forum?id=9jsZiUgkCZP
work page 2022
-
[49]
Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)
Zheng, C., Zhang, K., Yang, Z., Tan, W., Xiao, J., Ren, Y., Pu, S., et al.: Savit: Structure-aware vision transformer pruning via collaborative optimization. Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)
work page 2022
-
[50]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
work page 2017
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.