MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Mohammad Rastegari; Sachin Mehta

arxiv: 2110.02178 · v2 · pith:ELRZAFKNnew · submitted 2021-10-05 · 💻 cs.CV · cs.AI· cs.LG

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Sachin Mehta , Mohammad Rastegari This is my paper

Pith reviewed 2026-05-20 20:40 UTC · model grok-4.3

classification 💻 cs.CV cs.AIcs.LG

keywords MobileViTvision transformerlightweight CNNhybrid architecturemobile visionimage classificationobject detection

0 comments

The pith

MobileViT fuses local convolutions with global self-attention to build a lightweight vision transformer that outperforms both CNNs and ViTs on mobile devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper asks if the spatial inductive biases and parameter efficiency of CNNs can be combined with the global representation power of vision transformers to create models suitable for mobile hardware. It introduces MobileViT as a hybrid that processes information globally by treating transformers as convolutions. This produces a network with roughly 6 million parameters that reaches 78.4 percent top-1 accuracy on ImageNet-1k. A sympathetic reader would care because current mobile models trade off accuracy for speed, and a better balance could improve on-device vision applications such as detection without extra hardware cost.

Core claim

MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. By fusing local convolutional processing with global transformer blocks, the resulting light-weight network achieves 78.4 percent top-1 accuracy on ImageNet-1k with about 6 million parameters, which is 3.2 percent and 6.2 percent more accurate than MobileNetv3 and DeIT for similar parameter counts, and delivers 5.7 percent higher accuracy than MobileNetv3 on MS-COCO object detection.

What carries the argument

Transformers as convolutions, the mechanism that integrates global self-attention into a convolutional-style local processing pipeline to retain mobile efficiency.

If this is right

MobileViT can serve as a drop-in backbone for mobile classification and detection pipelines with improved accuracy at similar size.
The same hybrid pattern can be applied to other vision tasks while keeping parameter counts low.
Global context becomes accessible in mobile networks without the quadratic cost penalty typical of full vision transformers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same fusion strategy could be tested on segmentation or pose estimation to check whether the accuracy lift generalizes beyond classification and detection.
Hardware-specific optimizations of the convolution-transformer blocks might further reduce latency on particular mobile chips.
If the gains persist across more datasets, pure CNNs may no longer be the default starting point for new mobile vision models.

Load-bearing premise

The fusion of local convolutional processing with global transformer blocks produces the observed accuracy gains without hidden costs in latency or training stability on mobile hardware.

What would settle it

Standard mobile-device latency benchmarks showing that MobileViT runs slower than MobileNetv3 at the same parameter count while failing to match the reported accuracy difference.

read the original abstract

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. Our source code is open-source and available at: https://github.com/apple/ml-cvnets

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MobileViT, a hybrid CNN-Transformer architecture for mobile vision tasks. It proposes treating transformer blocks as convolutions to enable global information processing while retaining the spatial inductive biases and efficiency of CNNs. Empirical results claim that MobileViT achieves 78.4% top-1 accuracy on ImageNet-1k with ~6M parameters (3.2% better than MobileNetv3 and 6.2% better than DeIT at similar parameter counts) and a 5.7% accuracy improvement over MobileNetv3 on MS-COCO object detection.

Significance. If the accuracy gains are shown to come with genuinely low mobile latency and without hidden training or inference costs, the work would be significant for bridging CNN and ViT paradigms in resource-constrained settings. The open-source code release is a positive factor for reproducibility.

major comments (2)

[§4 (Experiments), Table 1 and Table 2] §4 (Experiments), Table 1 and Table 2: The central claim that the CNN-Transformer fusion produces a 'light-weight and low latency' network rests on parameter count and accuracy alone. No on-device latency, wall-clock inference time, or mobile-specific FLOPs measurements are provided to verify that the self-attention component does not introduce hidden runtime costs on target hardware, which directly undermines the mobile-friendly premise.
[§3.2 (MobileViT Block)] §3.2 (MobileViT Block): The description of 'transformers as convolutions' is central to the novelty, yet the manuscript lacks a dedicated ablation isolating the fusion ratios and block dimensions from standard ViT or CNN baselines. Without this, it is unclear whether the reported gains are due to the proposed design or to other factors such as training recipe or capacity.

minor comments (2)

The abstract and introduction would be clearer if they explicitly stated the measured latency or speed-up factors on mobile devices rather than relying solely on parameter counts.
[§3] Notation for the fusion operation in the MobileViT block could be made more precise with an equation reference to aid reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thoughtful review and constructive feedback on our manuscript. We address the major comments point-by-point below. Where appropriate, we will revise the manuscript to incorporate additional experiments and clarifications that strengthen the presentation of MobileViT's efficiency and design contributions.

read point-by-point responses

Referee: [§4 (Experiments), Table 1 and Table 2] §4 (Experiments), Table 1 and Table 2: The central claim that the CNN-Transformer fusion produces a 'light-weight and low latency' network rests on parameter count and accuracy alone. No on-device latency, wall-clock inference time, or mobile-specific FLOPs measurements are provided to verify that the self-attention component does not introduce hidden runtime costs on target hardware, which directly undermines the mobile-friendly premise.

Authors: We agree that explicit latency measurements would provide stronger support for the mobile-friendly claim. Parameter count and accuracy comparisons at iso-parameter budgets are standard in the literature for mobile models, and MobileViT reuses efficient convolutional operations for local processing while limiting self-attention to small spatial resolutions. Nevertheless, to directly address the concern, we will add wall-clock inference times and on-device latency results (measured on an iPhone 12) in the revised Section 4, along with a comparison of mobile-specific FLOPs where applicable. These additions will confirm that the hybrid design does not incur hidden runtime costs relative to MobileNetv3. revision: yes
Referee: [§3.2 (MobileViT Block)] §3.2 (MobileViT Block): The description of 'transformers as convolutions' is central to the novelty, yet the manuscript lacks a dedicated ablation isolating the fusion ratios and block dimensions from standard ViT or CNN baselines. Without this, it is unclear whether the reported gains are due to the proposed design or to other factors such as training recipe or capacity.

Authors: We thank the referee for highlighting the need for clearer isolation of the design choice. The manuscript already reports results against strong CNN (MobileNetv3) and ViT (DeiT) baselines at matched parameter counts, which controls for capacity and training recipe differences to a large extent. To further isolate the effect of treating transformers as convolutions and the specific fusion ratios, we will add a dedicated ablation study (new Table or subsection in §3.2) that varies the number and placement of MobileViT blocks while keeping total parameters and training settings fixed, directly comparing against pure CNN and pure ViT configurations of equivalent capacity. revision: yes

Circularity Check

0 steps flagged

Empirical performance claims rest on direct measurements against external baselines with no internal reduction

full rationale

The paper's central results consist of measured top-1 accuracy (78.4% on ImageNet-1k) and detection accuracy (on MS-COCO) for the proposed MobileViT architecture, compared directly to independent external models (MobileNetv3, DeIT). These quantities are obtained by training and evaluating the network on standard benchmarks rather than being derived from any fitted internal parameters, self-citations, or ansatz that would make the output equivalent to the input by construction. The architectural description (transformers as convolutions) is presented as a design choice whose value is then validated empirically; no load-bearing step in the reported chain reduces to a tautology or to a prior result authored by the same team. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 0 axioms · 0 invented entities

The central claim rests on the empirical performance of a hand-designed hybrid architecture whose internal dimensions, block counts, and fusion strategy are chosen by the authors; these choices function as free parameters that are not derived from first principles.

free parameters (1)

MobileViT block dimensions and fusion ratios
Number of transformer layers per block, channel widths, and how local and global features are combined are architectural hyperparameters tuned to achieve the reported accuracy.

pith-pipeline@v0.9.0 · 5809 in / 1222 out tokens · 34020 ms · 2026-05-20T20:40:48.410192+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith.Cost.FunctionalEquation washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. ... achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based)
IndisputableMonolith.Foundation.DimensionForcing alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MobileViT block replaces local processing in convolutions with global processing using transformers. ... effective receptive field of H×W

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 22 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
cs.CR 2026-05 unverdicted novelty 8.0

VIPER exposes Functional Fusion in dynamic prompt architectures, enabling a backdoor that resists pruning by tightly integrating attack and utility parameters in the same high-magnitude core.
RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis
eess.IV 2025-07 unverdicted novelty 8.0

Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis
cs.LG 2026-05 unverdicted novelty 7.0

QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
cs.CV 2026-05 unverdicted novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis
cs.CV 2026-05 unverdicted novelty 7.0

RAM-H1200 introduces a public dataset of 1,200 hand X-rays with whole-hand bone segmentation, pixel-level bone erosion masks, and joint-level SvdH scores for both erosion and narrowing to enable unified RA analysis.
ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction
cs.CV 2026-04 unverdicted novelty 7.0

ESIA casts pedestrian intention prediction as CRF structured prediction on a spatiotemporal graph, combining unary individual potentials, pairwise interaction potentials, and structural consistency penalties into a gl...
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition
cs.CV 2026-04 unverdicted novelty 7.0

KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
Cross-Stage Attention Propagation for Efficient Semantic Segmentation
cs.CV 2026-04 unverdicted novelty 7.0

CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Ti...
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
cs.CV 2026-05 unverdicted novelty 6.0

SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects
cs.CV 2026-05 unverdicted novelty 6.0

OneViewAll achieves 92.5% ADD-0.1 accuracy on LINEMOD for novel object 6D pose estimation using only one real reference view by integrating category, symmetry, and patch-level semantic priors in a projection-equivaria...
A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks
cs.LG 2026-04 conditional novelty 6.0

UCB-V and UCB-Tuned dominate accuracy-energy and accuracy-latency trade-offs while all tested UCB strategies achieve sub-linear regret in adaptive DNN early-exit experiments on CIFAR datasets.
LLM as a Tool, Not an Agent: Code-Mined Tree Transformations for Neural Architecture Search
cs.LG 2026-04 unverdicted novelty 6.0

LLMasTool improves neural architecture search by evolving code-mined hierarchical trees with diversity-guided Bayesian planning and targeted LLM assistance, reporting gains of 0.69, 1.83, and 2.68 points on CIFAR-10, ...
CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
cs.CV 2026-04 unverdicted novelty 6.0

Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.
TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models
eess.IV 2025-10 unverdicted novelty 6.0

TinyUSFM distills a large ultrasound foundation model into a lightweight version using feature-gradient coreset selection and domain-separated masked image modeling, matching performance on a new 18-dataset benchmark ...
Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera
cs.CV 2026-04 unverdicted novelty 5.0

A keypoint-based pipeline extracts and tracks points from event streams to compute accurate 6-DoF poses of moving objects, outperforming prior event-based methods in simulated and real tests.
CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model
cs.CV 2026-04 unverdicted novelty 5.0

Hybrid CNN-ViT with adaptive attention gate achieves 97.6% accuracy on brain tumor MRI classification, outperforming baselines.
EndoCaver: Handling Fog, Blur and Glare in Endoscopic Images via Joint Deblurring-Segmentation
eess.IV 2026-01 conditional novelty 5.0

EndoCaver introduces a unidirectional-guided dual-decoder transformer with GAM, DSA, and LoCoS modules for joint deblurring-segmentation, reporting 0.922 Dice on clean Kvasir-SEG data and 0.889 under degradation with ...
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
cs.CV 2023-06 conditional novelty 5.0

MobileSAM is a 60x smaller distilled version of SAM that matches original performance and runs 5x faster than concurrent FastSAM while supporting CPU inference.
CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification
cs.CV 2026-05 unverdicted novelty 4.0

CADS is a conformal-prediction-driven model cascade that routes images to scout or oracle models based on estimated complexity to reduce inference cost while preserving accuracy.
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
cs.CV 2026-05 unverdicted novelty 4.0

A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
Towards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation
cs.CV 2026-03 unverdicted novelty 4.0

KL-regularised Group DRO improves F1 scores for multi-site COVID-19 CT classification and gender-fair four-class lung pathology recognition over prior challenge baselines.
Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification
cs.CV 2025-07 unverdicted novelty 2.0

Hyperparameter tuning on seven lightweight models trained on a 90k-image ImageNet subset yields 1.5-3.5% top-1 accuracy gains, with RepVGG-A2 and MobileNetV3-L achieving sub-5ms latency and over 9800 FPS on GPU.

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · cited by 22 Pith papers · 6 internal anchors

[1]

Rethinking Atrous Convolution for Semantic Image Segmentation

Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. CrossVit: Cross-attention multi-scale vision transformer for image classiﬁcation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021a. Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXi...

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Mobile-former: Bridging mobilenet and transformer

Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile-former: Bridging mobilenet and transformer. arXiv preprint arXiv:2108.05895 , 2021b. Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258,

work page arXiv
[3]

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le

[Online; accessed 2-September-2021]. Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123,

work page 2021
[4]

Coatnet: Marrying convolution and attention for all data sizes

Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803,

work page arXiv
[5]

Convit: Improving vision transformers with soft convolutional inductive biases

St´ephane d’Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. arXiv preprint arXiv:2103.10697,

work page arXiv
[6]

Levit: a vision transformer in convnet’s clothing for faster inference

Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herv ´e J´egou, and Matthijs Douze. Levit: a vision transformer in convnet’s clothing for faster inference. arXiv preprint arXiv:2104.01136,

work page arXiv
[7]

Se- mantic contours from inverse detectors

Bharath Hariharan, Pablo Arbel ´aez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Se- mantic contours from inverse detectors. In 2011 International Conference on Computer Vision , pp. 991–998. IEEE,

work page 2011
[8]

Deep residual learning for image recog- nition

10 Published as a conference paper at ICLR 2022 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778,

work page 2022
[9]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Flattened Convolutional Neural Networks for Feedforward Acceleration

Jonghoon Jin, Aysegul Dundar, and Eugenio Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Localvit: Bringing locality to vision transformers

Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707,

work page arXiv
[12]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Espnet: Efﬁcient spatial pyramid of dilated convolutions for semantic segmentation

11 Published as a conference paper at ICLR 2022 Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh Hajishirzi. Espnet: Efﬁcient spatial pyramid of dilated convolutions for semantic segmentation. In Proceedings of the european conference on computer vision (ECCV), pp. 552–568,

work page 2022
[14]

Vision transformers for dense prediction

Ren´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. arXiv preprint arXiv:2103.13413,

work page arXiv
[15]

Dynamicvit: Efﬁcient vision transformers with dynamic token sparsiﬁcation.arXiv preprint arXiv:2106.02034,

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efﬁcient vision transformers with dynamic token sparsiﬁcation.arXiv preprint arXiv:2106.02034,

work page arXiv
[16]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par- allelism. arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909
[17]

Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F

12 Published as a conference paper at ICLR 2022 Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J ´egou. Training data-efﬁcient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. PMLR, 2021a. Hugo Touvron, Matthieu Cord, Alexandre Sablayrol...

work page arXiv 2022
[18]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review Pith/arXiv arXiv 2006
[19]

Cvt: Introducing convolutions to vision transformers

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808,

work page arXiv
[20]

Early convolutions help transformers see better

Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll ´ar, and Ross Girshick. Early convolutions help transformers see better. arXiv preprint arXiv:2106.14881,

work page arXiv
[21]

Incorporating con- volution designs into visual transformers

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. Incorporating con- volution designs into visual transformers. arXiv preprint arXiv:2103.11816, 2021a. Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on i...

work page arXiv
[22]

Multi-scale vision longformer: A new vision transformer for high-resolution image encoding

Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. arXiv preprint arXiv:2103.15358,

work page arXiv
[23]

Deepvit: Towards deeper vision transformer

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886,

work page arXiv
[24]

13 Published as a conference paper at ICLR 2022 A M OBILE VIT ARCHITECTURE MobileViT’s are inspired by the philosophy of light-weight CNNs and the overall architecture of MobileViT at different parameter budgets is given in Table

work page 2022
[25]

We set the output dimension of the ﬁrst feed-forward layer in a transformer layer as2d instead of 4d, a default value in the standard transformer block of Vaswani et al

The transformer layer in MobileViT takes a d-dimensional input, as shown in Figure 1b. We set the output dimension of the ﬁrst feed-forward layer in a transformer layer as2d instead of 4d, a default value in the standard transformer block of Vaswani et al. (2017). B M ULTI-SCALE SAMPLER Multi-scale sampler reduces generalization gap. Generalization capabi...

work page 2017
[26]

Here, d represents dimensionality of the input to the transformer layer in MobileViT block (Figure 1b)

Conv-1×1 1 320 384 640 Global pool 1×1 256 1Linear 1000 1000 1000 Network Parameters 1.3 M 2.3 M 5.6 M Table 4: MobileViT architecture. Here, d represents dimensionality of the input to the transformer layer in MobileViT block (Figure 1b). By default, in MobileViT block, we set kernel sizen as three and spatial dimensions of patch (height h and width w) i...

work page 2022
[27]

Impact of patch sizes

and ViT-based (Figure 7b) models, that too with basic data augmentation. Impact of patch sizes. MobileViT combines convolutions and transformers to learn local and global representations effectively. Because convolutions are applied on n× n regions and self-attention 15 Published as a conference paper at ICLR 2022 0 50 100 150 200 250 300 Epochs 20 25 30 ...

work page 2022
[28]

We can see that whenh, w≤ n, MobileViT can aggregate information more effectively, which helps improve performance. In our experiments, we used h = w = 2 instead of h = w = 3 because spatial dimensions of feature maps are multiples of 2, and using 16 Published as a conference paper at ICLR 2022 1 (a) h = w = 2 < n = 3 1 (b) h = w = n = 3 1 (c) h = w = 4 >...

work page 2022
[29]

Impact of exponential moving average and label smoothing

To avoid these extra operations, we choose h = w = 2 in our experiments, which also provides a good trade-off between latency and accuracy. Impact of exponential moving average and label smoothing. Exponential moving average (EMA) and label smoothing (LS) are two standard training methods that are used to improve CNN- and Transformer-based models performa...

work page 2018
[30]

17 Published as a conference paper at ICLR 2022 E E XTENDED DISCUSSION Memory footprint

than 512× 512, then the atrous kernel weights will be applied to padded zeros; making multi-scale learning ineffective. 17 Published as a conference paper at ICLR 2022 E E XTENDED DISCUSSION Memory footprint. A light-weight network running on mobile devices should be memory efﬁ- cient. Similar to MobileNetv2, we measure the memory that needs to be materia...

work page 2022
[31]

Therefore, similar to light-weight CNNs, MobileViT networks are also memory efﬁcient

where MobileViT blocks are employed, required memory is lesser or comparable to light-weight CNNs. Therefore, similar to light-weight CNNs, MobileViT networks are also memory efﬁcient. FLOPs. Floating point operations (FLOPs) is another metric that is widely used to measure the efﬁciency of a neural network. Table 9 compare FLOPs of MobileViT with differe...

work page 2015
[32]

For instance, MobileNetv2 has 2× fewer FLOPs as compared to MobileViTon the ImageNet-1k classiﬁcation task, but on the semantic segmentation, they have similar FLOPs (Table 10a vs

We can observe that (1) the gap between MobileNetv2 and MobileViT FLOPs reduces as the input resolution increases. For instance, MobileNetv2 has 2× fewer FLOPs as compared to MobileViTon the ImageNet-1k classiﬁcation task, but on the semantic segmentation, they have similar FLOPs (Table 10a vs. Table 10c) and (2) MobileNetv2 models are signiﬁcantly faster...

work page 2022
[33]

Here, † represents that Mo- bileViT model uses PyTorch’s Unfold and Fold operations

For GPU, inference time is measured for a batch of 32 images while for other devices, we use a batch size of one. Here, † represents that Mo- bileViT model uses PyTorch’s Unfold and Fold operations. Also, patch sizes for MobileViT model at an output stride of 8, 16, and 32 are set to two. GPU- accelerated operations for folding and unfolding as they are n...

work page 2022

[1] [1]

Rethinking Atrous Convolution for Semantic Image Segmentation

Chun-Fu Chen, Quanfu Fan, and Rameswar Panda. CrossVit: Cross-attention multi-scale vision transformer for image classiﬁcation. In Proceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), 2021a. Liang-Chieh Chen, George Papandreou, Florian Schroff, and Hartwig Adam. Rethinking atrous convolution for semantic image segmentation. arXi...

work page internal anchor Pith review Pith/arXiv arXiv

[2] [2]

Mobile-former: Bridging mobilenet and transformer

Yinpeng Chen, Xiyang Dai, Dongdong Chen, Mengchen Liu, Xiaoyi Dong, Lu Yuan, and Zicheng Liu. Mobile-former: Bridging mobilenet and transformer. arXiv preprint arXiv:2108.05895 , 2021b. Franc ¸ois Chollet. Xception: Deep learning with depthwise separable convolutions. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp. 1251–1258,

work page arXiv

[3] [3]

Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le

[Online; accessed 2-September-2021]. Ekin D Cubuk, Barret Zoph, Dandelion Mane, Vijay Vasudevan, and Quoc V Le. Autoaugment: Learning augmentation strategies from data. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 113–123,

work page 2021

[4] [4]

Coatnet: Marrying convolution and attention for all data sizes

Zihang Dai, Hanxiao Liu, Quoc V Le, and Mingxing Tan. Coatnet: Marrying convolution and attention for all data sizes. arXiv preprint arXiv:2106.04803,

work page arXiv

[5] [5]

Convit: Improving vision transformers with soft convolutional inductive biases

St´ephane d’Ascoli, Hugo Touvron, Matthew Leavitt, Ari Morcos, Giulio Biroli, and Levent Sagun. Convit: Improving vision transformers with soft convolutional inductive biases. arXiv preprint arXiv:2103.10697,

work page arXiv

[6] [6]

Levit: a vision transformer in convnet’s clothing for faster inference

Ben Graham, Alaaeldin El-Nouby, Hugo Touvron, Pierre Stock, Armand Joulin, Herv ´e J´egou, and Matthijs Douze. Levit: a vision transformer in convnet’s clothing for faster inference. arXiv preprint arXiv:2104.01136,

work page arXiv

[7] [7]

Se- mantic contours from inverse detectors

Bharath Hariharan, Pablo Arbel ´aez, Lubomir Bourdev, Subhransu Maji, and Jitendra Malik. Se- mantic contours from inverse detectors. In 2011 International Conference on Computer Vision , pp. 991–998. IEEE,

work page 2011

[8] [8]

Deep residual learning for image recog- nition

10 Published as a conference paper at ICLR 2022 Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recog- nition. In Proceedings of the IEEE conference on computer vision and pattern recognition , pp. 770–778,

work page 2022

[9] [9]

MobileNets: Efficient Convolutional Neural Networks for Mobile Vision Applications

Andrew G Howard, Menglong Zhu, Bo Chen, Dmitry Kalenichenko, Weijun Wang, Tobias Weyand, Marco Andreetto, and Hartwig Adam. Mobilenets: Efﬁcient convolutional neural networks for mobile vision applications. arXiv preprint arXiv:1704.04861,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Flattened Convolutional Neural Networks for Feedforward Acceleration

Jonghoon Jin, Aysegul Dundar, and Eugenio Culurciello. Flattened convolutional neural networks for feedforward acceleration. arXiv preprint arXiv:1412.5474,

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Localvit: Bringing locality to vision transformers

Yawei Li, Kai Zhang, Jiezhang Cao, Radu Timofte, and Luc Van Gool. Localvit: Bringing locality to vision transformers. arXiv preprint arXiv:2104.05707,

work page arXiv

[12] [12]

Swin Transformer: Hierarchical Vision Transformer using Shifted Windows

Ze Liu, Yutong Lin, Yue Cao, Han Hu, Yixuan Wei, Zheng Zhang, Stephen Lin, and Baining Guo. Swin transformer: Hierarchical vision transformer using shifted windows. arXiv preprint arXiv:2103.14030,

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Espnet: Efﬁcient spatial pyramid of dilated convolutions for semantic segmentation

11 Published as a conference paper at ICLR 2022 Sachin Mehta, Mohammad Rastegari, Anat Caspi, Linda Shapiro, and Hannaneh Hajishirzi. Espnet: Efﬁcient spatial pyramid of dilated convolutions for semantic segmentation. In Proceedings of the european conference on computer vision (ECCV), pp. 552–568,

work page 2022

[14] [14]

Vision transformers for dense prediction

Ren´e Ranftl, Alexey Bochkovskiy, and Vladlen Koltun. Vision transformers for dense prediction. arXiv preprint arXiv:2103.13413,

work page arXiv

[15] [15]

Dynamicvit: Efﬁcient vision transformers with dynamic token sparsiﬁcation.arXiv preprint arXiv:2106.02034,

Yongming Rao, Wenliang Zhao, Benlin Liu, Jiwen Lu, Jie Zhou, and Cho-Jui Hsieh. Dynamicvit: Efﬁcient vision transformers with dynamic token sparsiﬁcation.arXiv preprint arXiv:2106.02034,

work page arXiv

[16] [16]

Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism

Mohammad Shoeybi, Mostofa Patwary, Raul Puri, Patrick LeGresley, Jared Casper, and Bryan Catanzaro. Megatron-lm: Training multi-billion parameter language models using model par- allelism. arXiv preprint arXiv:1909.08053,

work page internal anchor Pith review Pith/arXiv arXiv 1909

[17] [17]

Wang, H., Ma, S., Dong, L., Huang, S., Zhang, D., and Wei, F

12 Published as a conference paper at ICLR 2022 Hugo Touvron, Matthieu Cord, Matthijs Douze, Francisco Massa, Alexandre Sablayrolles, and Herv´e J ´egou. Training data-efﬁcient image transformers & distillation through attention. In International Conference on Machine Learning, pp. 10347–10357. PMLR, 2021a. Hugo Touvron, Matthieu Cord, Alexandre Sablayrol...

work page arXiv 2022

[18] [18]

Linformer: Self-Attention with Linear Complexity

Sinong Wang, Belinda Z Li, Madian Khabsa, Han Fang, and Hao Ma. Linformer: Self-attention with linear complexity. arXiv preprint arXiv:2006.04768,

work page internal anchor Pith review Pith/arXiv arXiv 2006

[19] [19]

Cvt: Introducing convolutions to vision transformers

Haiping Wu, Bin Xiao, Noel Codella, Mengchen Liu, Xiyang Dai, Lu Yuan, and Lei Zhang. Cvt: Introducing convolutions to vision transformers. arXiv preprint arXiv:2103.15808,

work page arXiv

[20] [20]

Early convolutions help transformers see better

Tete Xiao, Mannat Singh, Eric Mintun, Trevor Darrell, Piotr Doll ´ar, and Ross Girshick. Early convolutions help transformers see better. arXiv preprint arXiv:2106.14881,

work page arXiv

[21] [21]

Incorporating con- volution designs into visual transformers

Kun Yuan, Shaopeng Guo, Ziwei Liu, Aojun Zhou, Fengwei Yu, and Wei Wu. Incorporating con- volution designs into visual transformers. arXiv preprint arXiv:2103.11816, 2021a. Li Yuan, Yunpeng Chen, Tao Wang, Weihao Yu, Yujun Shi, Zihang Jiang, Francis EH Tay, Jiashi Feng, and Shuicheng Yan. Tokens-to-token vit: Training vision transformers from scratch on i...

work page arXiv

[22] [22]

Multi-scale vision longformer: A new vision transformer for high-resolution image encoding

Pengchuan Zhang, Xiyang Dai, Jianwei Yang, Bin Xiao, Lu Yuan, Lei Zhang, and Jianfeng Gao. Multi-scale vision longformer: A new vision transformer for high-resolution image encoding. arXiv preprint arXiv:2103.15358,

work page arXiv

[23] [23]

Deepvit: Towards deeper vision transformer

Daquan Zhou, Bingyi Kang, Xiaojie Jin, Linjie Yang, Xiaochen Lian, Zihang Jiang, Qibin Hou, and Jiashi Feng. Deepvit: Towards deeper vision transformer. arXiv preprint arXiv:2103.11886,

work page arXiv

[24] [24]

13 Published as a conference paper at ICLR 2022 A M OBILE VIT ARCHITECTURE MobileViT’s are inspired by the philosophy of light-weight CNNs and the overall architecture of MobileViT at different parameter budgets is given in Table

work page 2022

[25] [25]

We set the output dimension of the ﬁrst feed-forward layer in a transformer layer as2d instead of 4d, a default value in the standard transformer block of Vaswani et al

The transformer layer in MobileViT takes a d-dimensional input, as shown in Figure 1b. We set the output dimension of the ﬁrst feed-forward layer in a transformer layer as2d instead of 4d, a default value in the standard transformer block of Vaswani et al. (2017). B M ULTI-SCALE SAMPLER Multi-scale sampler reduces generalization gap. Generalization capabi...

work page 2017

[26] [26]

Here, d represents dimensionality of the input to the transformer layer in MobileViT block (Figure 1b)

Conv-1×1 1 320 384 640 Global pool 1×1 256 1Linear 1000 1000 1000 Network Parameters 1.3 M 2.3 M 5.6 M Table 4: MobileViT architecture. Here, d represents dimensionality of the input to the transformer layer in MobileViT block (Figure 1b). By default, in MobileViT block, we set kernel sizen as three and spatial dimensions of patch (height h and width w) i...

work page 2022

[27] [27]

Impact of patch sizes

and ViT-based (Figure 7b) models, that too with basic data augmentation. Impact of patch sizes. MobileViT combines convolutions and transformers to learn local and global representations effectively. Because convolutions are applied on n× n regions and self-attention 15 Published as a conference paper at ICLR 2022 0 50 100 150 200 250 300 Epochs 20 25 30 ...

work page 2022

[28] [28]

We can see that whenh, w≤ n, MobileViT can aggregate information more effectively, which helps improve performance. In our experiments, we used h = w = 2 instead of h = w = 3 because spatial dimensions of feature maps are multiples of 2, and using 16 Published as a conference paper at ICLR 2022 1 (a) h = w = 2 < n = 3 1 (b) h = w = n = 3 1 (c) h = w = 4 >...

work page 2022

[29] [29]

Impact of exponential moving average and label smoothing

To avoid these extra operations, we choose h = w = 2 in our experiments, which also provides a good trade-off between latency and accuracy. Impact of exponential moving average and label smoothing. Exponential moving average (EMA) and label smoothing (LS) are two standard training methods that are used to improve CNN- and Transformer-based models performa...

work page 2018

[30] [30]

17 Published as a conference paper at ICLR 2022 E E XTENDED DISCUSSION Memory footprint

than 512× 512, then the atrous kernel weights will be applied to padded zeros; making multi-scale learning ineffective. 17 Published as a conference paper at ICLR 2022 E E XTENDED DISCUSSION Memory footprint. A light-weight network running on mobile devices should be memory efﬁ- cient. Similar to MobileNetv2, we measure the memory that needs to be materia...

work page 2022

[31] [31]

Therefore, similar to light-weight CNNs, MobileViT networks are also memory efﬁcient

where MobileViT blocks are employed, required memory is lesser or comparable to light-weight CNNs. Therefore, similar to light-weight CNNs, MobileViT networks are also memory efﬁcient. FLOPs. Floating point operations (FLOPs) is another metric that is widely used to measure the efﬁciency of a neural network. Table 9 compare FLOPs of MobileViT with differe...

work page 2015

[32] [32]

For instance, MobileNetv2 has 2× fewer FLOPs as compared to MobileViTon the ImageNet-1k classiﬁcation task, but on the semantic segmentation, they have similar FLOPs (Table 10a vs

We can observe that (1) the gap between MobileNetv2 and MobileViT FLOPs reduces as the input resolution increases. For instance, MobileNetv2 has 2× fewer FLOPs as compared to MobileViTon the ImageNet-1k classiﬁcation task, but on the semantic segmentation, they have similar FLOPs (Table 10a vs. Table 10c) and (2) MobileNetv2 models are signiﬁcantly faster...

work page 2022

[33] [33]

Here, † represents that Mo- bileViT model uses PyTorch’s Unfold and Fold operations

For GPU, inference time is measured for a batch of 32 images while for other devices, we use a batch size of one. Here, † represents that Mo- bileViT model uses PyTorch’s Unfold and Fold operations. Also, patch sizes for MobileViT model at an output stride of 8, 16, and 32 are set to two. GPU- accelerated operations for folding and unfolding as they are n...

work page 2022