AdaVFM integrates neural architecture search into vision foundation model backbones and uses a cloud multimodal LLM agent to enable runtime-adaptive lightweight subnet execution, delivering up to 7.9% higher accuracy and 77.9% lower FLOPs than fixed-size baselines on edge devices.
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.CV 3verdicts
UNVERDICTED 3representative citing papers
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
STEP uses dynamic superpatch merging via dCTS and early token exits to cut token count by 2.5x and computational complexity by up to 4x on ViT-Large for high-res segmentation, with at most 2% accuracy drop and 40% tokens halted early.
citing papers explorer
-
AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution
AdaVFM integrates neural architecture search into vision foundation model backbones and uses a cloud multimodal LLM agent to enable runtime-adaptive lightweight subnet execution, delivering up to 7.9% higher accuracy and 77.9% lower FLOPs than fixed-size baselines on edge devices.
-
Accelerating Vision Foundation Models with Drop-in Depthwise Convolution
Replacing selected attention heads in pretrained ViTs with depthwise convolutions, identified by simple strategies and recovered via fine-tuning, delivers 17-20% inference speedup on image tasks with minimal accuracy loss.
-
Where Do Tokens Go? Understanding Pruning Behaviors in STEP at High Resolutions
STEP uses dynamic superpatch merging via dCTS and early token exits to cut token count by 2.5x and computational complexity by up to 4x on ViT-Large for high-res segmentation, with at most 2% accuracy drop and 40% tokens halted early.