hub

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

Sachin Mehta, Mohammad Rastegari · 2021 · cs.CV · arXiv 2110.02178

23 Pith papers cite this work. Polarity classification is still indexing.

23 Pith papers citing it

open full Pith review browse 23 citing papers arXiv PDF

abstract

Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. Our source code is open-source and available at: https://github.com/apple/ml-cvnets

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 2 method 1

citation-polarity summary

background 2 use method 1

representative citing papers

Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures

cs.CR · 2026-05-19 · unverdicted · novelty 8.0

VIPER exposes Functional Fusion in dynamic prompt architectures, enabling a backdoor that resists pruning by tightly integrating attack and utility parameters in the same high-magnitude core.

RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis

eess.IV · 2025-07-07 · unverdicted · novelty 8.0

Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.

Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis

cs.LG · 2026-05-15 · unverdicted · novelty 7.0

QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.

TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles

cs.CV · 2026-05-12 · unverdicted · novelty 7.0

TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.

RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis

cs.CV · 2026-05-07 · unverdicted · novelty 7.0

RAM-H1200 introduces a public dataset of 1,200 hand X-rays with whole-hand bone segmentation, pixel-level bone erosion masks, and joint-level SvdH scores for both erosion and narrowing to enable unified RA analysis.

KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition

cs.CV · 2026-04-25 · unverdicted · novelty 7.0

KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.

Cross-Stage Attention Propagation for Efficient Semantic Segmentation

cs.CV · 2026-04-07 · unverdicted · novelty 7.0

CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Tiny on ADE20K, Cityscapes, and COCO-Stuff.

EmbodiTTA: Resource-Efficient Test-Time Adaptation for Embodied Visual Systems

cs.LG · 2025-05-02 · unverdicted · novelty 7.0

OD-TTA enables resource-efficient test-time adaptation on edge devices by triggering updates only on detected domain shifts, achieving comparable accuracy with lower energy and computation costs for embodied visual systems.

SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection

cs.CV · 2026-05-13 · unverdicted · novelty 6.0

SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.

OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects

cs.CV · 2026-05-07 · unverdicted · novelty 6.0

OneViewAll achieves 92.5% ADD-0.1 accuracy on LINEMOD for novel object 6D pose estimation using only one real reference view by integrating category, symmetry, and patch-level semantic priors in a projection-equivariant alignment.

LLM as a Tool, Not an Agent: Code-Mined Tree Transformations for Neural Architecture Search

cs.LG · 2026-04-17 · unverdicted · novelty 6.0

LLMasTool improves neural architecture search by evolving code-mined hierarchical trees with diversity-guided Bayesian planning and targeted LLM assistance, reporting gains of 0.69, 1.83, and 2.68 points on CIFAR-10, CIFAR-100, and ImageNet16-120.

CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation

cs.CV · 2026-04-16 · unverdicted · novelty 6.0

Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.

TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models

eess.IV · 2025-10-22 · unverdicted · novelty 6.0

TinyUSFM distills a large ultrasound foundation model into a lightweight version using feature-gradient coreset selection and domain-separated masked image modeling, matching performance on a new 18-dataset benchmark with 6.36% of the parameters.

A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks

cs.LG · 2026-04-27 · unverdicted · novelty 5.0 · 2 refs

Comparative study applies UCB-V, UCB-Tuned, UCB-Bayes and UCB-BwK to ADNN early-exit selection on ResNet and MobileViT using CIFAR-10/100, reporting sub-linear regret with UCB-Bayes fastest and UCB-V/UCB-Tuned best on accuracy-energy and accuracy-latency Pareto fronts.

Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera

cs.CV · 2026-04-25 · unverdicted · novelty 5.0

A keypoint-based pipeline extracts and tracks points from event streams to compute accurate 6-DoF poses of moving objects, outperforming prior event-based methods in simulated and real tests.

CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model

cs.CV · 2026-04-25 · unverdicted · novelty 5.0

Hybrid CNN-ViT with adaptive attention gate achieves 97.6% accuracy on brain tumor MRI classification, outperforming baselines.

EndoCaver: Handling Fog, Blur and Glare in Endoscopic Images via Joint Deblurring-Segmentation

eess.IV · 2026-01-30 · conditional · novelty 5.0

EndoCaver introduces a unidirectional-guided dual-decoder transformer with GAM, DSA, and LoCoS modules for joint deblurring-segmentation, reporting 0.922 Dice on clean Kvasir-SEG data and 0.889 under degradation with 90% fewer parameters.

Faster Segment Anything: Towards Lightweight SAM for Mobile Applications

cs.CV · 2023-06-25 · conditional · novelty 5.0

MobileSAM is a 60x smaller distilled version of SAM that matches original performance and runs 5x faster than concurrent FastSAM while supporting CPU inference.

CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification

cs.CV · 2026-05-12 · unverdicted · novelty 4.0

CADS is a conformal-prediction-driven model cascade that routes images to scout or oracle models based on estimated complexity to reduce inference cost while preserving accuracy.

Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey

cs.CV · 2026-05-07 · unverdicted · novelty 4.0

A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.

Towards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation

cs.CV · 2026-03-16 · unverdicted · novelty 4.0

KL-regularised Group DRO improves F1 scores for multi-site COVID-19 CT classification and gender-fair four-class lung pathology recognition over prior challenge baselines.

Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification

cs.CV · 2025-07-31 · unverdicted · novelty 2.0

Hyperparameter tuning on seven lightweight models trained on a 90k-image ImageNet subset yields 1.5-3.5% top-1 accuracy gains, with RepVGG-A2 and MobileNetV3-L achieving sub-5ms latency and over 9800 FPS on GPU.

ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction

cs.CV · 2026-04-26

citing papers explorer

Showing 23 of 23 citing papers.

Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures cs.CR · 2026-05-19 · unverdicted · none · ref 17 · internal anchor
VIPER exposes Functional Fusion in dynamic prompt architectures, enabling a backdoor that resists pruning by tightly integrating attack and utility parameters in the same high-magnitude core.
RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis eess.IV · 2025-07-07 · unverdicted · none · ref 46 · internal anchor
Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis cs.LG · 2026-05-15 · unverdicted · none · ref 18 · internal anchor
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles cs.CV · 2026-05-12 · unverdicted · none · ref 37 · internal anchor
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis cs.CV · 2026-05-07 · unverdicted · none · ref 41 · internal anchor
RAM-H1200 introduces a public dataset of 1,200 hand X-rays with whole-hand bone segmentation, pixel-level bone erosion masks, and joint-level SvdH scores for both erosion and narrowing to enable unified RA analysis.
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition cs.CV · 2026-04-25 · unverdicted · none · ref 55 · internal anchor
KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
Cross-Stage Attention Propagation for Efficient Semantic Segmentation cs.CV · 2026-04-07 · unverdicted · none · ref 19 · internal anchor
CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Tiny on ADE20K, Cityscapes, and COCO-Stuff.
EmbodiTTA: Resource-Efficient Test-Time Adaptation for Embodied Visual Systems cs.LG · 2025-05-02 · unverdicted · none · ref 31 · internal anchor
OD-TTA enables resource-efficient test-time adaptation on edge devices by triggering updates only on detected domain shifts, achieving comparable accuracy with lower energy and computation costs for embodied visual systems.
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection cs.CV · 2026-05-13 · unverdicted · none · ref 36 · internal anchor
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects cs.CV · 2026-05-07 · unverdicted · none · ref 60 · internal anchor
OneViewAll achieves 92.5% ADD-0.1 accuracy on LINEMOD for novel object 6D pose estimation using only one real reference view by integrating category, symmetry, and patch-level semantic priors in a projection-equivariant alignment.
LLM as a Tool, Not an Agent: Code-Mined Tree Transformations for Neural Architecture Search cs.LG · 2026-04-17 · unverdicted · none · ref 36 · internal anchor
LLMasTool improves neural architecture search by evolving code-mined hierarchical trees with diversity-guided Bayesian planning and targeted LLM assistance, reporting gains of 0.69, 1.83, and 2.68 points on CIFAR-10, CIFAR-100, and ImageNet16-120.
CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation cs.CV · 2026-04-16 · unverdicted · none · ref 20 · internal anchor
Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.
TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models eess.IV · 2025-10-22 · unverdicted · none · ref 6 · internal anchor
TinyUSFM distills a large ultrasound foundation model into a lightweight version using feature-gradient coreset selection and domain-separated masked image modeling, matching performance on a new 18-dataset benchmark with 6.36% of the parameters.
A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks cs.LG · 2026-04-27 · unverdicted · none · ref 28 · 2 links · internal anchor
Comparative study applies UCB-V, UCB-Tuned, UCB-Bayes and UCB-BwK to ADNN early-exit selection on ResNet and MobileViT using CIFAR-10/100, reporting sub-linear regret with UCB-Bayes fastest and UCB-V/UCB-Tuned best on accuracy-energy and accuracy-latency Pareto fronts.
Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera cs.CV · 2026-04-25 · unverdicted · none · ref 27 · internal anchor
A keypoint-based pipeline extracts and tracks points from event streams to compute accurate 6-DoF poses of moving objects, outperforming prior event-based methods in simulated and real tests.
CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model cs.CV · 2026-04-25 · unverdicted · none · ref 4 · internal anchor
Hybrid CNN-ViT with adaptive attention gate achieves 97.6% accuracy on brain tumor MRI classification, outperforming baselines.
EndoCaver: Handling Fog, Blur and Glare in Endoscopic Images via Joint Deblurring-Segmentation eess.IV · 2026-01-30 · conditional · none · ref 16 · internal anchor
EndoCaver introduces a unidirectional-guided dual-decoder transformer with GAM, DSA, and LoCoS modules for joint deblurring-segmentation, reporting 0.922 Dice on clean Kvasir-SEG data and 0.889 under degradation with 90% fewer parameters.
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications cs.CV · 2023-06-25 · conditional · none · ref 15 · internal anchor
MobileSAM is a 60x smaller distilled version of SAM that matches original performance and runs 5x faster than concurrent FastSAM while supporting CPU inference.
CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification cs.CV · 2026-05-12 · unverdicted · none · ref 32 · internal anchor
CADS is a conformal-prediction-driven model cascade that routes images to scout or oracle models based on estimated complexity to reduce inference cost while preserving accuracy.
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey cs.CV · 2026-05-07 · unverdicted · none · ref 206 · internal anchor
A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
Towards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation cs.CV · 2026-03-16 · unverdicted · none · ref 21 · internal anchor
KL-regularised Group DRO improves F1 scores for multi-site COVID-19 CT classification and gender-fair four-class lung pathology recognition over prior challenge baselines.
Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification cs.CV · 2025-07-31 · unverdicted · none · ref 4 · internal anchor
Hyperparameter tuning on seven lightweight models trained on a 90k-image ImageNet subset yields 1.5-3.5% top-1 accuracy gains, with RepVGG-A2 and MobileNetV3-L achieving sub-5ms latency and over 9800 FPS on GPU.
ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction cs.CV · 2026-04-26 · unreviewed · ref 53 · internal anchor

MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer