VIPER exposes Functional Fusion in dynamic prompt architectures, enabling a backdoor that resists pruning by tightly integrating attack and utility parameters in the same high-magnitude core.
hub
MobileViT: Light-weight, General-purpose, and Mobile-friendly Vision Transformer
23 Pith papers cite this work. Polarity classification is still indexing.
abstract
Light-weight convolutional neural networks (CNNs) are the de-facto for mobile vision tasks. Their spatial inductive biases allow them to learn representations with fewer parameters across different vision tasks. However, these networks are spatially local. To learn global representations, self-attention-based vision trans-formers (ViTs) have been adopted. Unlike CNNs, ViTs are heavy-weight. In this paper, we ask the following question: is it possible to combine the strengths of CNNs and ViTs to build a light-weight and low latency network for mobile vision tasks? Towards this end, we introduce MobileViT, a light-weight and general-purpose vision transformer for mobile devices. MobileViT presents a different perspective for the global processing of information with transformers, i.e., transformers as convolutions. Our results show that MobileViT significantly outperforms CNN- and ViT-based networks across different tasks and datasets. On the ImageNet-1k dataset, MobileViT achieves top-1 accuracy of 78.4% with about 6 million parameters, which is 3.2% and 6.2% more accurate than MobileNetv3 (CNN-based) and DeIT (ViT-based) for a similar number of parameters. On the MS-COCO object detection task, MobileViT is 5.7% more accurate than MobileNetv3 for a similar number of parameters. Our source code is open-source and available at: https://github.com/apple/ml-cvnets
hub tools
citation-role summary
citation-polarity summary
representative citing papers
Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
RAM-H1200 introduces a public dataset of 1,200 hand X-rays with whole-hand bone segmentation, pixel-level bone erosion masks, and joint-level SvdH scores for both erosion and narrowing to enable unified RA analysis.
KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Tiny on ADE20K, Cityscapes, and COCO-Stuff.
OD-TTA enables resource-efficient test-time adaptation on edge devices by triggering updates only on detected domain shifts, achieving comparable accuracy with lower energy and computation costs for embodied visual systems.
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
OneViewAll achieves 92.5% ADD-0.1 accuracy on LINEMOD for novel object 6D pose estimation using only one real reference view by integrating category, symmetry, and patch-level semantic priors in a projection-equivariant alignment.
LLMasTool improves neural architecture search by evolving code-mined hierarchical trees with diversity-guided Bayesian planning and targeted LLM assistance, reporting gains of 0.69, 1.83, and 2.68 points on CIFAR-10, CIFAR-100, and ImageNet16-120.
Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.
TinyUSFM distills a large ultrasound foundation model into a lightweight version using feature-gradient coreset selection and domain-separated masked image modeling, matching performance on a new 18-dataset benchmark with 6.36% of the parameters.
Comparative study applies UCB-V, UCB-Tuned, UCB-Bayes and UCB-BwK to ADNN early-exit selection on ResNet and MobileViT using CIFAR-10/100, reporting sub-linear regret with UCB-Bayes fastest and UCB-V/UCB-Tuned best on accuracy-energy and accuracy-latency Pareto fronts.
A keypoint-based pipeline extracts and tracks points from event streams to compute accurate 6-DoF poses of moving objects, outperforming prior event-based methods in simulated and real tests.
Hybrid CNN-ViT with adaptive attention gate achieves 97.6% accuracy on brain tumor MRI classification, outperforming baselines.
EndoCaver introduces a unidirectional-guided dual-decoder transformer with GAM, DSA, and LoCoS modules for joint deblurring-segmentation, reporting 0.922 Dice on clean Kvasir-SEG data and 0.889 under degradation with 90% fewer parameters.
MobileSAM is a 60x smaller distilled version of SAM that matches original performance and runs 5x faster than concurrent FastSAM while supporting CPU inference.
CADS is a conformal-prediction-driven model cascade that routes images to scout or oracle models based on estimated complexity to reduce inference cost while preserving accuracy.
A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
KL-regularised Group DRO improves F1 scores for multi-site COVID-19 CT classification and gender-fair four-class lung pathology recognition over prior challenge baselines.
Hyperparameter tuning on seven lightweight models trained on a 90k-image ImageNet subset yields 1.5-3.5% top-1 accuracy gains, with RepVGG-A2 and MobileNetV3-L achieving sub-5ms latency and over 9800 FPS on GPU.
citing papers explorer
-
Exposing Functional Fusion: A New Class of Strategic Backdoor in Dynamic Prompt Architectures
VIPER exposes Functional Fusion in dynamic prompt architectures, enabling a backdoor that resists pruning by tightly integrating attack and utility parameters in the same high-magnitude core.
-
RAM-W600: A Multi-Task Wrist Dataset and Benchmark for Rheumatoid Arthritis
Introduces RAM-W600, the first public multi-task dataset of wrist conventional radiographs with instance segmentation annotations and Sharp/van der Heijde bone erosion scores for rheumatoid arthritis research.
-
Characterizing Learning in Deep Neural Networks using Tractable Algorithmic Complexity Analysis
QuBD extends algorithmic complexity estimation to quantized DNN weights, revealing that complexity decreases during learning, increases with overfitting, follows grokking patterns, and correlates with generalization.
-
TCP-SSM: Efficient Vision State Space Models with Token-Conditioned Poles
TCP-SSM conditions stable poles on visual tokens to explicitly control memory decay and oscillation in SSMs, cutting computation up to 44% while matching or exceeding accuracy on classification, segmentation, and detection.
-
RAM-H1200: A Unified Evaluation and Dataset on Hand Radiographs for Rheumatoid Arthritis
RAM-H1200 introduces a public dataset of 1,200 hand X-rays with whole-hand bone segmentation, pixel-level bone erosion masks, and joint-level SvdH scores for both erosion and narrowing to enable unified RA analysis.
-
KAConvNet: Kolmogorov-Arnold Convolutional Networks for Vision Recognition
KAConvNet introduces a Kolmogorov-Arnold Convolutional Layer to build networks competitive with ViTs and CNNs while offering stronger theoretical interpretability.
-
Cross-Stage Attention Propagation for Efficient Semantic Segmentation
CSAP computes attention at the deepest scale and propagates the maps to shallower stages, bypassing per-scale query-key computations to cut decoder FLOPs while preserving multi-scale performance and beating SegNeXt-Tiny on ADE20K, Cityscapes, and COCO-Stuff.
-
EmbodiTTA: Resource-Efficient Test-Time Adaptation for Embodied Visual Systems
OD-TTA enables resource-efficient test-time adaptation on edge devices by triggering updates only on detected domain shifts, achieving comparable accuracy with lower energy and computation costs for embodied visual systems.
-
SToRe3D: Sparse Token Relevance in ViTs for Efficient Multi-View 3D Object Detection
SToRe3D delivers up to 3x faster inference for multi-view 3D object detection in ViTs by selecting relevant 2D tokens and 3D queries via mutual relevance heads with only marginal accuracy loss.
-
OneViewAll: Semantic Prior Guided One-View 6D Pose Estimation for Novel Objects
OneViewAll achieves 92.5% ADD-0.1 accuracy on LINEMOD for novel object 6D pose estimation using only one real reference view by integrating category, symmetry, and patch-level semantic priors in a projection-equivariant alignment.
-
LLM as a Tool, Not an Agent: Code-Mined Tree Transformations for Neural Architecture Search
LLMasTool improves neural architecture search by evolving code-mined hierarchical trees with diversity-guided Bayesian planning and targeted LLM assistance, reporting gains of 0.69, 1.83, and 2.68 points on CIFAR-10, CIFAR-100, and ImageNet16-120.
-
CMTM: Cross-Modal Token Modulation for Unsupervised Video Object Segmentation
Cross-modal token modulation enables better fusion of appearance and motion cues in two-stream models, leading to state-of-the-art results in unsupervised video object segmentation.
-
TinyUSFM: Towards Compact and Efficient Ultrasound Foundation Models
TinyUSFM distills a large ultrasound foundation model into a lightweight version using feature-gradient coreset selection and domain-separated masked image modeling, matching performance on a new 18-dataset benchmark with 6.36% of the parameters.
-
A Comparative Analysis on the Performance of Upper Confidence Bound Algorithms in Adaptive Deep Neural Networks
Comparative study applies UCB-V, UCB-Tuned, UCB-Bayes and UCB-BwK to ADNN early-exit selection on ResNet and MobileViT using CIFAR-10/100, reporting sub-linear regret with UCB-Bayes fastest and UCB-V/UCB-Tuned best on accuracy-energy and accuracy-latency Pareto fronts.
-
Keypoint-based Dynamic Object 6-DoF Pose Tracking via Event Camera
A keypoint-based pipeline extracts and tracks points from event streams to compute accurate 6-DoF poses of moving objects, outperforming prior event-based methods in simulated and real tests.
-
CNN-ViT Fusion with Adaptive Attention Gate for Brain Tumor MRI Classification: A Hybrid Deep Learning Model
Hybrid CNN-ViT with adaptive attention gate achieves 97.6% accuracy on brain tumor MRI classification, outperforming baselines.
-
EndoCaver: Handling Fog, Blur and Glare in Endoscopic Images via Joint Deblurring-Segmentation
EndoCaver introduces a unidirectional-guided dual-decoder transformer with GAM, DSA, and LoCoS modules for joint deblurring-segmentation, reporting 0.922 Dice on clean Kvasir-SEG data and 0.889 under degradation with 90% fewer parameters.
-
Faster Segment Anything: Towards Lightweight SAM for Mobile Applications
MobileSAM is a 60x smaller distilled version of SAM that matches original performance and runs 5x faster than concurrent FastSAM while supporting CPU inference.
-
CADS: Conformal Adaptive Decision System for Cost-Efficient Image Classification
CADS is a conformal-prediction-driven model cascade that routes images to scout or oracle models based on estimated complexity to reduce inference cost while preserving accuracy.
-
Edge Deep Learning in Computer Vision and Medical Diagnostics: A Comprehensive Survey
A comprehensive survey of edge deep learning in computer vision and medical diagnostics that presents a novel categorization of hardware platforms by performance and usage scenarios.
-
Towards Fair and Robust Volumetric CT Classification via KL-Regularised Group Distributionally Robust Optimisation
KL-regularised Group DRO improves F1 scores for multi-site COVID-19 CT classification and gender-fair four-class lung pathology recognition over prior challenge baselines.
-
Analysis of Hyperparameter Optimization Effects on Lightweight Deep Models for Real-Time Image Classification
Hyperparameter tuning on seven lightweight models trained on a 90k-image ImageNet subset yields 1.5-3.5% top-1 accuracy gains, with RepVGG-A2 and MobileNetV3-L achieving sub-5ms latency and over 9800 FPS on GPU.
- ESIA: An Energy-Based Spatiotemporal Interaction-Aware Framework for Pedestrian Intention Prediction