pith. sign in

arxiv: 2605.22132 · v1 · pith:IQGO64UQnew · submitted 2026-05-21 · 💻 cs.CV

Accelerating Vision Foundation Models with Drop-in Depthwise Convolution

Pith reviewed 2026-05-22 06:58 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision transformersdepthwise convolutionmodel accelerationinference speedupattention headsimage classificationsemantic segmentation
0
0 comments X

The pith

Replacing selected attention heads in pretrained Vision Transformers with depthwise convolutions delivers 17-20% inference speedup with minimal performance loss.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to reduce the high inference costs of large Vision Transformer backbones in pretrained vision foundation models. It does this by identifying attention heads that already perform operations similar to convolutions and replacing them with lightweight depthwise convolution layers. Targeted identification strategies and a fine-tuning procedure help recover performance on downstream tasks. The result is faster execution on image classification and segmentation benchmarks without major accuracy drops. This matters because it makes strong pretrained models more usable on devices with limited compute resources.

Core claim

The authors establish that some attention heads in pretrained Vision Transformers exhibit intrinsic convolution-like behavior. They introduce an efficient depthwise convolution layer as a drop-in replacement for these heads, along with simple strategies to select which heads to replace and a fine-tuning procedure that recovers downstream task performance. This substitution achieves 17-20% inference speedup across image classification and segmentation tasks with minimal performance degradation.

What carries the argument

The efficient depthwise convolution-based layer serving as a drop-in replacement for convolution-like attention heads in the ViT backbone.

If this is right

  • Inference runs 17-20% faster on image classification tasks after the replacements.
  • Similar speed gains appear on segmentation tasks with only small accuracy changes.
  • Fine-tuning after replacement restores most original task performance.
  • Straightforward metrics identify which heads can be replaced without breaking the model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The head-replacement idea could apply to other transformer families if comparable convolution-like heads appear there.
  • Pairing the method with quantization might produce larger total efficiency improvements.
  • Automatic detection of replaceable heads could remove the need for manual selection rules.

Load-bearing premise

Some attention heads in pretrained ViTs exhibit intrinsic convolution-like behavior that permits them to be replaced by an efficient depthwise convolution layer while preserving overall feature extraction capabilities.

What would settle it

Measure the cosine similarity between feature maps produced by the selected attention heads and by depthwise convolutions on identical inputs; low similarity for the chosen heads would indicate the replacement cannot preserve performance.

Figures

Figures reproduced from arXiv: 2605.22132 by Carmelo Scribano, Danda Pani Paudel, Giorgia Franchini, Luc Van Gool, Marko Bertogna, Mohammad Mahdi, Nedyalko Prisadnikov, Yuqian Fu.

Figure 1
Figure 1. Figure 1: Illustration of the proposed drop-in approximation. We replace attention (a) with a Depthwise convolution (b), which improves inference speed while reusing the pre-trained network parameters for performance. To address these challenges, in this paper, we propose an efficient, drop-in accel￾eration method for foundation ViTs. Building on previous research (Section 2.2), we assume that several Multi-head Sel… view at source ↗
Figure 2
Figure 2. Figure 2: Speedup vs number of heads re￾placed in blockwise and scattered se￾tups. Results on ViT-L (24 blocks, 16 heads per block 336 × 336). 0 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 Block 0 5 10 15 20 25 30 35 b/100 b b DSP (|S|= 12) DSP (|S|= 17) [PITH_FULL_IMAGE:figures/full_fig_p012_2.png] view at source ↗
read the original abstract

Pretrained vision foundation models deliver strong performance across tasks with limited fine-tuning. However, their Vision Transformer (ViT) backbones impose high inference costs, limiting deployment on resource-constrained devices. In this work, we accelerate large-scale pretrained ViTs while preserving their feature extraction capabilities by exploiting the intrinsic convolution-like behavior of some attention heads. Specifically, we introduce an efficient depthwise convolution-based layer that serves as a drop-in replacement for these heads. Additionally, we propose simple strategies to identify which heads can be replaced and introduce a fine-tuning procedure that recovers downstream task performance. Across both image classification and segmentation tasks, our method achieves 17-20\% percent inference speedup with minimal performance degradation. We validate the approach through detailed derivations, extensive experiments, and efficiency benchmarks. The reference implementation is publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript describes a method for accelerating pretrained Vision Transformers by replacing certain attention heads with depthwise convolution layers. The approach exploits what the authors describe as intrinsic convolution-like behavior in some heads, using simple strategies to select them and a fine-tuning procedure to maintain performance. The key result is a 17-20% inference speedup on classification and segmentation tasks with minimal degradation, supported by experiments and efficiency benchmarks. The code is made publicly available.

Significance. If validated, this could be a valuable contribution to efficient inference for vision foundation models, offering a practical acceleration technique that preserves the benefits of pretraining with limited additional training. The emphasis on drop-in replacement and public implementation supports potential adoption in the field.

major comments (1)
  1. [Section 3.2] The description of the head selection strategies needs to explicitly demonstrate that identification is performed using only pretrained model characteristics without reference to post-substitution performance on downstream tasks. This is crucial to support the claim of intrinsic convolution-like behavior rather than an architecture search with recovery.
minor comments (1)
  1. [Abstract] The phrase '17-20% percent' contains a redundant 'percent' and should be revised to '17-20%'.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We appreciate the emphasis on clarifying the head selection process to better support our claims regarding intrinsic convolution-like behavior. Below we address the major comment point by point.

read point-by-point responses
  1. Referee: [Section 3.2] The description of the head selection strategies needs to explicitly demonstrate that identification is performed using only pretrained model characteristics without reference to post-substitution performance on downstream tasks. This is crucial to support the claim of intrinsic convolution-like behavior rather than an architecture search with recovery.

    Authors: We agree that this distinction is important for substantiating the intrinsic nature of the observed behavior. The head selection strategies described in Section 3.2 are based exclusively on characteristics extracted from the pretrained model (e.g., analysis of attention weight distributions, token interaction patterns, and layer-wise statistics computed directly on the frozen pretrained weights and activations). No downstream task data, fine-tuning, or post-replacement accuracy measurements are used at any stage of identification. To address the request for explicit demonstration, we have revised Section 3.2 to include a new clarifying paragraph that states the selection criteria rely solely on pretrained model properties and explicitly notes the absence of any reference to post-substitution performance. We have also added a short proof-of-concept experiment in the revised section showing that the selected heads exhibit convolution-like properties when evaluated on the pretrained model alone, prior to any replacement or fine-tuning. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical replacement technique with independent validation

full rationale

The paper describes an empirical method to identify and replace selected attention heads in pretrained ViTs with depthwise convolution layers, followed by fine-tuning to recover task performance. No equations, derivations, or self-citations are presented that reduce the claimed 17-20% speedup or performance preservation to a fitted parameter, renamed input, or load-bearing self-reference by construction. Head selection and replacement are framed as exploiting intrinsic pretrained behavior, with validation through experiments rather than a closed loop. The work is self-contained against external benchmarks and does not exhibit any of the enumerated circularity patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that certain attention heads already compute something close to a depthwise convolution; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Some attention heads in pretrained ViTs exhibit intrinsic convolution-like behavior
    This premise is required for the drop-in replacement to preserve feature quality.

pith-pipeline@v0.9.0 · 5693 in / 1205 out tokens · 41919 ms · 2026-05-22T06:58:48.874615+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

50 extracted references · 50 canonical work pages · 2 internal anchors

  1. [1]

    In: The 2020 Conference on Empirical Methods in Natural Language Processing

    Behnke, M., Heafield, K.: Losing heads in the lottery: Pruning transformer. In: The 2020 Conference on Empirical Methods in Natural Language Processing. pp. 2664–2674. Association for Computational Linguistics (ACL) (2020)

  2. [2]

    In: International Conference on Learning Represen- tations (2023)

    Bolya, D., Fu, C.Y., Dai, X., Zhang, P., Feichtenhofer, C., Hoffman, J.: Token merging: Your ViT but faster. In: International Conference on Learning Represen- tations (2023)

  3. [3]

    In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics

    Brix, C., Bahar, P., Ney, H.: Successfully applying the stabilized lottery ticket hypothesis to the transformer architecture. In: Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. pp. 3909–3915 (2020)

  4. [4]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Cai, H., Li, J., Hu, M., Gan, C., Han, S.: Efficientvit: Lightweight multi-scale attention for high-resolution dense prediction. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 17302–17313 (2023)

  5. [5]

    In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)

    Caron, M., Touvron, H., Misra, I., Jégou, H., Mairal, J., Bojanowski, P., Joulin, A.: Emerging properties in self-supervised vision transformers. In: Proceedings of the International Conference on Computer Vision (ICCV) (2021)

  6. [6]

    Advances in neural infor- mation processing systems33, 15834–15846 (2020)

    Chen, T., Frankle, J., Chang, S., Liu, S., Zhang, Y., Wang, Z., Carbin, M.: The lottery ticket hypothesis for pre-trained bert networks. Advances in neural infor- mation processing systems33, 15834–15846 (2020)

  7. [7]

    In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=HJlnC1rKPB

    Cordonnier, J.B., Loukas, A., Jaggi, M.: On the relationship between self-attention and convolutional layers. In: International Conference on Learning Representations (2020), https://openreview.net/forum?id=HJlnC1rKPB

  8. [8]

    In: International Conference on Learning Representations (ICLR) (2024)

    Dao, T.: FlashAttention-2: Faster attention with better parallelism and work par- titioning. In: International Conference on Learning Representations (ICLR) (2024)

  9. [9]

    In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

    Dao,T.,Fu,D.Y.,Ermon,S.,Rudra,A.,Ré,C.:FlashAttention:Fastandmemory- efficient exact attention with IO-awareness. In: Advances in Neural Information Processing Systems (NeurIPS) (2022)

  10. [10]

    In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers)

    Devlin, J., Chang, M.W., Lee, K., Toutanova, K.: Bert: Pre-training of deep bidi- rectional transformers for language understanding. In: Proceedings of the 2019 conference of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long and short papers). pp. 4171–4186 (2019)

  11. [11]

    In: International Conference on Learning Representations (2021), https://openreview

    Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N.: An image is worth 16x16 words: Transformers for image recognition at scale. In: International Conference on Learning Representations (2021), https://openreview. net/forum?id=YicbFdNTTy

  12. [12]

    Graham,B.,El-Nouby,A.,Touvron,H.,Stock,P.,Joulin,A.,Jégou,H.,Douze,M.: Levit:avisiontransformerinconvnet’sclothingforfasterinference.In:Proceedings of the IEEE/CVF international conference on computer vision. pp. 12259–12269 (2021)

  13. [13]

    In: Inter- national Conference on Learning Representations (2022), https://openreview.net/ forum?id=L3_SsSNMmy

    Han, Q., Fan, Z., Dai, Q., Sun, L., Cheng, M.M., Liu, J., Wang, J.: On the con- nection between local attention and dynamic depth-wise convolution. In: Inter- national Conference on Learning Representations (2022), https://openreview.net/ forum?id=L3_SsSNMmy

  14. [14]

    Advances in neural information processing systems5(1992)

    Hassibi, B., Stork, D.: Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems5(1992)

  15. [15]

    IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 14 C

    He, H., Cai, J., Liu, J., Pan, Z., Zhang, J., Tao, D., Zhuang, B.: Pruning self- attentions into convolutional layers in single path. IEEE Transactions on Pattern Analysis and Machine Intelligence (2024) 14 C. Scribano et al

  16. [16]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalable vision learners. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16000–16009 (2022)

  17. [17]

    Categorical Reparameterization with Gumbel-Softmax

    Jang, E., Gu, S., Poole, B.: Categorical reparameterization with gumbel-softmax. arXiv preprint arXiv:1611.01144 (2016)

  18. [18]

    Advances in Neural Information Processing Systems35, 37822–37836 (2022)

    Jelassi, S., Sander, M., Li, Y.: Vision transformers provably learn spatial structure. Advances in Neural Information Processing Systems35, 37822–37836 (2022)

  19. [19]

    In: Chaudhuri, K., Salakhutdinov, R

    Kool,W.,VanHoof,H.,Welling,M.:Stochasticbeamsandwheretofindthem:The Gumbel-top-k trick for sampling sequences without replacement. In: Chaudhuri, K., Salakhutdinov, R. (eds.) Proceedings of the 36th International Conference on Machine Learning. Proceedings of Machine Learning Research, vol. 97, pp. 3499–

  20. [20]

    PMLR (09–15 Jun 2019), https://proceedings.mlr.press/v97/kool19a.html

  21. [21]

    Block pruning for faster transformers

    Lagunas, F., Charlaix, E., Sanh, V., Rush, A.M.: Block pruning for faster trans- formers. arXiv preprint arXiv:2109.04838 (2021)

  22. [22]

    Advances in neural infor- mation processing systems2(1989)

    LeCun, Y., Denker, J., Solla, S.: Optimal brain damage. Advances in neural infor- mation processing systems2(1989)

  23. [23]

    Transactions of the Association for Computational Linguistics9, 1442–1459 (2021)

    Li,J.,Cotterell,R.,Sachan,M.:Differentiablesubsetpruningoftransformerheads. Transactions of the Association for Computational Linguistics9, 1442–1459 (2021)

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition

    Lin, S., Lyu, P., Liu, D., Tang, T., Liang, X., Song, A., Chang, X.: Mlp can be a good transformer learner. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition. pp. 19489–19498 (2024)

  25. [25]

    Microsoft COCO: Common Objects in Context

    Lin, T., Maire, M., Belongie, S.J., Bourdev, L.D., Girshick, R.B., Hays, J., Perona, P., Ramanan, D., Doll’a r, P., Zitnick, C.L.: Microsoft COCO: common objects in context. CoRRabs/1405.0312(2014), http://arxiv.org/abs/1405.0312

  26. [26]

    arXiv preprint arXiv:2110.03860 (2021)

    Marin, D., Chang, J.H.R., Ranjan, A., Prabhu, A., Rastegari, M., Tuzel, O.: Token pooling in vision transformers. arXiv preprint arXiv:2110.03860 (2021)

  27. [27]

    In: International Conference on Learning Representa- tions (2022)

    Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile- friendly vision transformer. In: International Conference on Learning Representa- tions (2022)

  28. [28]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

    Meng, L., Li, H., Chen, B.C., Lan, S., Wu, Z., Jiang, Y.G., Lim, S.N.: Adavit: Adaptive vision transformers for efficient image recognition. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 12309– 12318 (2022)

  29. [29]

    Michel, P., Levy, O., Neubig, G.: Are sixteen heads really better than one? Ad- vances in neural information processing systems32(2019)

  30. [30]

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H.V., Szafraniec, M., Khalidov, V., Fernandez,P.,Haziza,D.,Massa,F.,El-Nouby,A.,Howes,R.,Huang,P.Y.,Xu,H., Sharma, V., Li, S.W., Galuba, W., Rabbat, M., Assran, M., Ballas, N., Synnaeve, G., Misra, I., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P.: Dinov2: Learning robust visual features without ...

  31. [31]

    Available at SSRN 4529273 (2023)

    Prasetyo, Y., Yudistira, N., Widodo, A.W.: Sparse then prune: Toward efficient vision transformers. Available at SSRN 4529273 (2023)

  32. [32]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

  33. [33]

    Raghu, M., Unterthiner, T., Kornblith, S., Zhang, C., Dosovitskiy, A.: Do vision transformers see like convolutional neural networks? Advances in neural informa- tion processing systems34, 12116–12128 (2021) Accelerating Vision Foundation Models with Drop-in Depthwise Convolution 15

  34. [34]

    Advances in neural infor- mation processing systems34, 13937–13949 (2021)

    Rao, Y., Zhao, W., Liu, B., Lu, J., Zhou, J., Hsieh, C.J.: Dynamicvit: Efficient vision transformers with dynamic token sparsification. Advances in neural infor- mation processing systems34, 13937–13949 (2021)

  35. [35]

    Berg and Li Fei-Fei , Title =

    Russakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla, A., Bernstein, M., Berg, A.C., Fei-Fei, L.: ImageNet Large Scale Visual Recognition Challenge. International Journal of Computer Vision (IJCV)115(3), 211–252 (2015). https://doi.org/10.1007/s11263-015-0816-y

  36. [36]

    Advances in neural information processing systems33, 20378–20389 (2020)

    Sanh, V., Wolf, T., Rush, A.: Movement pruning: Adaptive sparsity by fine-tuning. Advances in neural information processing systems33, 20378–20389 (2020)

  37. [37]

    In: Proceedings of the IEEE/CVF winter conference on appli- cations of computer vision

    Shen, Z., Zhang, M., Zhao, H., Yi, S., Li, H.: Efficient attention: Attention with linear complexities. In: Proceedings of the IEEE/CVF winter conference on appli- cations of computer vision. pp. 3531–3539 (2021)

  38. [38]

    In: International Conference on Machine Learning

    Touvron, H., Cord, M., Douze, M., Massa, F., Sablayrolles, A., Jegou, H.: Training data-efficient image transformers & distillation through attention. In: International Conference on Machine Learning. vol. 139, pp. 10347–10357 (July 2021)

  39. [39]

    Advances in neural information pro- cessing systems30(2017)

    Vaswani,A.,Shazeer,N.,Parmar,N.,Uszkoreit,J.,Jones,L.,Gomez,A.N.,Kaiser, Ł., Polosukhin, I.: Attention is all you need. Advances in neural information pro- cessing systems30(2017)

  40. [40]

    In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics

    Voita, E., Talbot, D., Moiseev, F., Sennrich, R., Titov, I.: Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In: Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics. pp. 5797–5808 (2019)

  41. [41]

    Wadekar,S.N.,Chaurasia,A.:Mobilevitv3:Mobile-friendlyvisiontransformerwith simple and effective fusion of local, global and input features (2022)

  42. [42]

    Technometrics4(3), 419–420 (1962)

    Welford, B.P.: Note on a method for calculating corrected sums of squares and products. Technometrics4(3), 419–420 (1962)

  43. [43]

    In: Pro- ceedings of the AAAI conference on artificial intelligence

    Xiong, Y., Zeng, Z., Chakraborty, R., Tan, M., Fung, G., Li, Y., Singh, V.: Nys- trömformer: A nyström-based algorithm for approximating self-attention. In: Pro- ceedings of the AAAI conference on artificial intelligence. vol. 35, pp. 14138–14148 (2021)

  44. [44]

    In: European Conference on Computer Vision

    Xu, K., Wang, Z., Chen, C., Geng, X., Lin, J., Yang, X., Wu, M., Li, X., Lin, W.: Lpvit: Low-power semi-structured pruning for vision transformers. In: European Conference on Computer Vision. pp. 269–287. Springer (2024)

  45. [45]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yang, H., Yin, H., Shen, M., Molchanov, P., Li, H., Kautz, J.: Global vision trans- former pruning with hessian-aware saliency. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18547–18557 (2023)

  46. [46]

    In: Forty-first International Conference on Machine Learning (2024)

    Yao, Z., Wang, J., Wu, H., Wang, J., Long, M.: Mobile attention: mobile-friendly linear-attention for vision transformers. In: Forty-first International Conference on Machine Learning (2024)

  47. [47]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Yin, H., Vahdat, A., Alvarez, J.M., Mallya, A., Kautz, J., Molchanov, P.: A-vit: Adaptive tokens for efficient vision transformer. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 10809–10818 (2022)

  48. [48]

    In: International Conference on Learning Repre- sentations (2022), https://openreview.net/forum?id=9jsZiUgkCZP

    Yu, S., Chen, T., Shen, J., Yuan, H., Tan, J., Yang, S., Liu, J., Wang, Z.: Unified visual transformer compression. In: International Conference on Learning Repre- sentations (2022), https://openreview.net/forum?id=9jsZiUgkCZP

  49. [49]

    Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

    Zheng, C., Zhang, K., Yang, Z., Tan, W., Xiao, J., Ren, Y., Pu, S., et al.: Savit: Structure-aware vision transformer pruning via collaborative optimization. Ad- vances in Neural Information Processing Systems35, 9010–9023 (2022)

  50. [50]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)