pith. sign in

arxiv: 2604.15622 · v2 · submitted 2026-04-17 · 💻 cs.CV · cs.LG

AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution

Pith reviewed 2026-05-10 08:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords adaptive inferencevision foundation modelsedge computingLLM guidanceneural architecture searchon-device AIzero-shot classificationopen-vocabulary segmentation
0
0 comments X

The pith

AdaVFM dynamically scales vision foundation models at runtime via LLM guidance to improve accuracy-efficiency trade-offs on edge devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an adaptive framework for running language-aligned vision foundation models on edge hardware with tight latency and power limits. It creates a family of lightweight model variants through neural architecture search and uses a cloud-based multimodal LLM to select the appropriate variant based on scene context and task difficulty. This approach matters because fixed large models exceed edge constraints while fixed small models lose accuracy on complex inputs, and the task-dependent nature of size reduction allows dynamic choices to improve overall performance. Experiments on zero-shot classification and open-vocabulary segmentation confirm gains over static baselines.

Core claim

The central claim is that the performance impact of model size reduction varies by task and scene in vision applications, so a runtime-adaptive execution strategy can maintain high accuracy while cutting average computation. AdaVFM embeds neural architecture search into the vision foundation model backbone to produce executable subnets of different sizes. A multimodal LLM agent deployed on the cloud provides context-aware control to select the right subnet during inference, enabling efficient adaptation across conditions.

What carries the argument

The runtime selection of NAS-derived subnets in the language-aligned VFM backbone, guided by a multimodal LLM agent for context-aware computation scaling.

If this is right

  • Surpasses prior adaptive and static methods by up to 7.9% top-1 accuracy on ImageNet-1K for models of comparable size.
  • Delivers up to 5.2% higher mean IoU on ADE20K for segmentation models of similar scale.
  • Reduces average FLOPs by up to 77.9% while preserving comparable accuracy levels.
  • Enables practical zero-shot classification and open-vocabulary segmentation under edge latency and power limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The cloud-edge split with LLM control could extend to other foundation models where input difficulty varies across samples.
  • Runtime adaptation may lower average energy use in continuous mobile operation beyond what static compression achieves.
  • End-to-end training of the selection agent with the vision subnets might further tighten the accuracy-efficiency curve.

Load-bearing premise

The accuracy loss from using smaller model variants varies enough by scene and task that dynamic selection yields a better overall trade-off than any fixed size.

What would settle it

A controlled test on inputs where accuracy degradation from model compression is identical regardless of scene complexity or task difficulty, showing no benefit from adaptation over the best static model.

Figures

Figures reproduced from arXiv: 2604.15622 by Barbara De Salvo, Chiao Liu, Cijo Jose, Huapeng Su, Jieyu Lin, Michael Ramamonjisoa, Patrick Labatut, Phillip B. Gibbons, Stefano Ambrogio, Yiwei Zhao, Yi Zheng, Ziyun Li.

Figure 1
Figure 1. Figure 1: Left (a): Always-on smart glasses with on-device VFM. Right (b): End-to￾end mIoU on open-vocabulary ADE20K segmentation [77]. Our design significantly improves mIoU by up to 5.2%, and reduces FLOPs by up to 77.9% over prior models. Deploying vision foundation models (VFMs) on edge devices, however, re￾mains challenging, and enabling language-aligned VFMs is even more difficult. These models are large, comp… view at source ↗
Figure 2
Figure 2. Figure 2: Open-vocabulary segmentation on ADE [77] using 50M and 300M-parameter VFMs (300M: DINO.txt [29]; 50M: similarly distilled and fine-tuned). Results are shown for both the original 150-class and the grouped 9-class setting (grouping based on WordNet [18], described in § 6.2). Left: Both models perform similarly on the sim￾pler 9-class task, but in the complex 150-class setting, VFM-300M yields clear object b… view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the proposed AdaVFM. Left: Edge-side execution, where the adaptive Vision Encoder follows cloud agent instructions to select an efficient execution scheme and perform vision-text contrastive inference. Right: Cloud-side execution, where the agent uses scene/context information to generate semantic understanding for the Text Encoder and execution guidance for the Vision Encoder. scene and contex… view at source ↗
Figure 4
Figure 4. Figure 4: Left (a): Operation flow: LLM-based runtime management agent. Right (b): Overview of the training pipeline. The vision backbone is first distilled from a founda￾tion model (DINOv2 [49]), followed by vision-text alignment using CLIP. Both stages employ NAS and sandwich sampling [70]. 3.3 LLM-Guided Efficient Runtime Execution A central component of our system is the LLM-guided runtime management agent, invo… view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Left (a): Impact of the LLM runtime agent on open-vocabulary segmentation on ADE20K [77]. NAS uses text-aligned subnets without the LLM agent; AdaVFM w/o Subnet Selection uses LLM only for semantic class filtering; AdaVFM uses the LLM for both semantic class filtering and adaptive subnet selection. Efficiency gains show that runtime adaptive selection is critical. Right (b): Subnet selection across α. Larg… view at source ↗
Figure 1
Figure 1. Figure 1: Our ARM Ethos-U55 test silicon. B Architecture of Basic Blocks B.1 ConvNext-v2 Blocks We use ConvNeXt-v2 [62] with selective capacity as the core building block of our model, as shown in Fig. 2a. All block widths (dims) are selectable during training and runtime, enabling the adaptive behavior of our model. We also replace GELU with ReLU for better compatibility and efficiency on edge devices. B.2 Downsamp… view at source ↗
Figure 2
Figure 2. Figure 2: (a) Left: Selective ConvNeXt-v2 blocks. (b) Right: Downsample layers. The block widths (dims) are selectable during training and runtime. Downsample Layer 1 uses two consecutive 3 × 3-Conv2D layers with stride 2, yielding an effective downsampling factor of 4. Downsample Layer 2 uses a single 3 × 3-Conv2D layer with stride 2. Downsample Layers 3 and 4 instead apply 1 × 1-Conv2D layers with stride 2. C Grou… view at source ↗
Figure 3
Figure 3. Figure 3: Left (a): End-to-end trade-off between mIoU on open-vocabulary ADE20K segmentation [77] and average execution latency. Right (b): End-to-end trade-off be￾tween mIoU and average energy consumption. E End-to-End Accuracy-Efficiency Trade-offs with Additional Metrics In the main paper, we present the accuracy-efficiency trade-off (Fig. 1b in the main paper) using FLOPs. Here, we additionally report results ba… view at source ↗
read the original abstract

Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to $7.9\%$ in acc@1 on IN1K and $5.2\%$ mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to $77.9\%$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes AdaVFM, an adaptive framework for on-device inference of language-aligned vision foundation models (VFMs). It integrates neural architecture search (NAS) into the VFM backbone to enable runtime execution of lightweight subnets and uses a cloud-deployed multimodal LLM as a context-aware agent to dynamically select the execution path based on scene context and task complexity. The central claim is that this approach exploits the task-dependent impact of model size reduction to achieve superior accuracy-efficiency trade-offs, with reported gains of up to 7.9% top-1 accuracy on ImageNet-1K zero-shot classification and 5.2% mIoU on ADE20K open-vocabulary segmentation, plus up to 77.9% average FLOPs reduction for comparable accuracy.

Significance. If the experimental results are reproducible and the adaptive mechanism is shown to be the primary driver, this could meaningfully advance edge deployment of large VFMs by offering a practical runtime adaptation strategy without retraining. The cross-modal use of an LLM agent for control is a notable design choice that could generalize to other adaptive inference settings.

major comments (2)
  1. [Abstract and §4] Abstract and §4 (Experiments): The stated improvements (7.9% acc@1 on IN1K, 5.2% mIoU on ADE20K, 77.9% FLOPs reduction) are presented without any description of baselines, model sizes compared, number of runs, variance, or statistical tests. This directly undermines evaluation of the central accuracy-efficiency claim.
  2. [§3] §3 (Method): The motivating assumption that 'the effect of model size reduction on performance is task-dependent' is used to justify the entire adaptive NAS+LLM design, yet no controlled ablation or analysis is referenced showing how performance degradation varies across tasks/scenes to support the runtime selection policy.
minor comments (2)
  1. [§3.3] Clarify the exact interface between the on-device NAS subnets and the cloud LLM agent, including latency overhead of the control loop and any assumptions about network connectivity.
  2. Ensure consistent terminology for 'subnet' vs. 'model size' throughout; the NAS integration description would benefit from a diagram of the searchable space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying our experimental setup where possible and committing to revisions that strengthen the presentation of results and the justification for our design choices.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (Experiments): The stated improvements (7.9% acc@1 on IN1K, 5.2% mIoU on ADE20K, 77.9% FLOPs reduction) are presented without any description of baselines, model sizes compared, number of runs, variance, or statistical tests. This directly undermines evaluation of the central accuracy-efficiency claim.

    Authors: We agree that the abstract and §4 would benefit from greater explicitness to allow full evaluation of the claims. The manuscript already compares against fixed-size VFM baselines (e.g., CLIP-ViT variants, BLIP, and prior NAS methods) of comparable parameter counts and FLOPs, with the 7.9% and 5.2% figures representing the maximum observed gains over the strongest such baseline at each operating point. In the revised version we will (i) expand the abstract and §4 to list the exact baseline models and their sizes, (ii) report results averaged over 3–5 runs with standard deviations, and (iii) add paired statistical significance tests for the key accuracy and FLOPs differences. These additions will be placed in a new “Evaluation Protocol” subsection of §4. revision: yes

  2. Referee: [§3] §3 (Method): The motivating assumption that 'the effect of model size reduction on performance is task-dependent' is used to justify the entire adaptive NAS+LLM design, yet no controlled ablation or analysis is referenced showing how performance degradation varies across tasks/scenes to support the runtime selection policy.

    Authors: The core insight is indeed that performance sensitivity to model size varies with scene complexity and task type; this is what enables the LLM agent to select subnets profitably at runtime. While §4 already shows that AdaVFM outperforms fixed-size models on two distinct tasks (zero-shot classification and open-vocabulary segmentation), we acknowledge that a more targeted, controlled demonstration of the variation itself would strengthen the motivation. In the revised manuscript we will add a dedicated ablation (new Figure or subsection in §3 or §4) that measures accuracy degradation for the same set of subnets across controlled subsets of ImageNet and ADE20K stratified by scene complexity (e.g., object density, lighting variation). This will directly illustrate the task/scene dependence that justifies the adaptive policy. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an engineering framework for adaptive VFM inference using NAS and cloud LLM control, motivated by the empirical observation that model size reduction effects are task-dependent. No equations, first-principles derivations, or predictions are presented that reduce to inputs by construction. All performance claims rest on external benchmark comparisons (IN1K, ADE20K) against prior baselines, with no self-citation load-bearing steps or fitted parameters renamed as predictions. The derivation chain is self-contained against external experimental evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; the approach appears to rest on standard NAS and LLM components whose concrete realizations are not specified.

pith-pipeline@v0.9.0 · 5568 in / 1207 out tokens · 61214 ms · 2026-05-10T08:34:04.739899+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 1 internal anchor

  1. [1]

    Meta ray-ban smart glasses.https://www.meta.com/ai-glasses/ray-ban-meta/ (2023), a series of AI-enabled smart glasses combining camera, audio, and voice- controlled Meta AI features, developed by Meta Platforms in partnership with Ray-Ban

  2. [2]

    In: The Twelfth International Conference on Learning Representations (2024)

    Abbaspourazad, S., Elachqar, O., Miller, A., Emrani, S., Nallasamy, U., Shapiro, I.: Large-scale training of foundation models for wearable biosignals. In: The Twelfth International Conference on Learning Representations (2024)

  3. [3]

    Kumar,et al., Sub-200Ω·𝜇m alloyed contacts to synthetic monolayer MoS2, in2021 IEEE International Electron Devices Meeting (IEDM)(IEEE) (2021), pp

    Abrash, M.: Creating the future: Augmented reality, the next human-machine in- terface. In: 2021 IEEE International Electron Devices Meeting (IEDM). pp. 1–11 (2021).https://doi.org/10.1109/IEDM19574.2021.9720526

  4. [4]

    In: Proceedings of the 36th International Conference on Neural Information Pro- cessing Systems

    Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millicah, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisser- man, A., Simonyan, K.: Flamingo:...

  5. [5]

    Arm®: Arm ethos-u55 micronpu description.https://www.arm.com/products/ silicon-ip-cpu/ethos/ethos-u55(Accessed 2026-03)

  6. [6]

    In: European Conference on Computer Vision (2014)

    Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative com- ponents with random forests. In: European Conference on Computer Vision (2014)

  7. [7]

    In: Proceedings of the 20th ACM International Conference on Multi- modal Interaction

    Brun,D.:Multimodalandcontext-awareinteractioninaugmentedrealityforactive assistance. In: Proceedings of the 20th ACM International Conference on Multi- modal Interaction. pp. 506–510 (2018)

  8. [8]

    In: International Conference on Learning Representations (2020),https://openreview.net/forum?id=HylxE1HKwS

    Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S.: Once-for-all: Train one network and specialize it for efficient deployment. In: International Conference on Learning Representations (2020),https://openreview.net/forum?id=HylxE1HKwS

  9. [9]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Chen, J., Hu, J., Wang, G., Jiang, Z., Zhou, T., Chen, Z., Lv, C.: Taoavatar: Real-time lifelike full-body talking avatars for augmented reality via 3d gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10723–10734 (June 2025)

  10. [10]

    In: The Eleventh International Conference on Learning Representations (2023)

    Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. In: The Eleventh International Conference on Learning Representations (2023)

  11. [11]

    In: Proceedings of the IEEE/CVF International Conference on computer vision

    Chu, X., Zhang, B., Xu, R.: Fairnas: Rethinking evaluation fairness of weight shar- ing neural architecture search. In: Proceedings of the IEEE/CVF International Conference on computer vision. pp. 12239–12248 (2021) 16 Y. Zhao et al

  12. [12]

    In: Proceedings of the IEEE Conf

    Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2014)

  13. [13]

    In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

    Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009)

  14. [14]

    In: Proceedings of the 40th International Conference on Machine Learning

    Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., Florence, P.: Palm-e: an embodied multimodal language model. In: Proceedings of the 40th Inter...

  15. [15]

    In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

    Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV

  16. [16]

    pp. 75–92. Springer Nature Switzerland, Cham (2025)

  17. [17]

    In: The Twelfth International Conference on Learning Repre- sentations (2024),https://openreview.net/forum?id=KAk6ngZ09F

    Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A.T., Shankar, V.: Data filtering networks. In: The Twelfth International Conference on Learning Repre- sentations (2024),https://openreview.net/forum?id=KAk6ngZ09F

  18. [18]

    Computer vision and Image understanding106(1), 59–70 (2007)

    Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object cate- gories. Computer vision and Image understanding106(1), 59–70 (2007)

  19. [19]

    MIT press (1998)

    Fellbaum, C.: WordNet: An electronic lexical database. MIT press (1998)

  20. [20]

    Fundamental AI Research, M.: Introducing llama 4: Advancing multimodal intelli- gence.https://ai.meta.com/blog/llama-4-multimodal-intelligence/(2024), accessed April 5, 2025

  21. [21]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Fei-Fei, L.: Fine-grained car detection for visual census estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 31 (2017)

  22. [22]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Georg, M., Tanzer, G., Uboweja, E., Hassan, S., Shengelia, M., Sepah, S., Forbes, S., Starner, T.: Fsboard: Over 3 million characters of asl fingerspelling collected via smartphones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13897–13906 (June 2025)

  23. [23]

    In: European conference on computer vision

    Ghiasi,G.,Gu,X.,Cui,Y.,Lin,T.Y.:Scalingopen-vocabularyimagesegmentation with image-level labels. In: European conference on computer vision. pp. 540–557. Springer (2022)

  24. [24]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Goswami, R.G., Krishnamurthy, P., LeCun, Y., Khorrami, F.: Robopepp: Vision- based robot pose and joint angle estimation through embedding predictive pre- training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6930–6939 (June 2025)

  25. [25]

    He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalablevisionlearners.In:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR) (2022)

  26. [26]

    Ieee Access (2025)

    Hoang, M.L.: A comprehensive review of machine learning, and deep learning in wearable iot devices. Ieee Access (2025)

  27. [27]

    Le and Hartwig Adam , year =

    Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L.C., Tan, M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., Le, Q.: Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1314–1324 (2019).https://doi.org/10.1109/ICCV.2019.00140 AdaVFM 17

  28. [28]

    Zenodo (2021)

    Ilharco, G., Wortsman, M., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., et al.: Openclip. Zenodo (2021)

  29. [29]

    Fiaz, Al- ham Fikri Aji, and Hisham Cholakkal

    Imam, M.F., Marew, R.F., Hassan, J., Fiaz, M., Aji, A.F., Cholakkal, H.: Clip meets dino for tuning zero-shot classifier using unlabeled image collections. arXiv preprint arXiv:2411.19346 (2024)

  30. [30]

    In: Proceedings of the Computer Vision and Pattern Recognition Conference

    Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al.: Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24905–24916 (2025)

  31. [31]

    Google DeepMind Blog (Mar 2025)

    Kavukcuoglu, K., Pichai, S., Hassabis, D., Walker, K., Manyika, J., Porat, R.: Gemini-2.5: Our most intelligent ai model. Google DeepMind Blog (Mar 2025)

  32. [32]

    In: Proceedings of the AAAI Conference on Artificial Intelligence

    Khattak, M.U., Naeem, M.F., Naseer, M., Van Gool, L., Tombari, F.: Learning to prompt with text only supervision for vision-language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4230–4238 (2025)

  33. [33]

    In: 2023 IEEE International Conference on Consumer Electronics (ICCE)

    Kim, S.Y., Chung, D.o., Lee, K., Lee, C., Huh, J.: Low-power always-on cam- era (aoc) system with workload offloading to cmos image sensor. In: 2023 IEEE International Conference on Consumer Electronics (ICCE). pp. 1–2. IEEE (2023)

  34. [34]

    Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)

  35. [35]

    In: 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)

    Lane, N.D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, L., Qendro, L., Kawsar, F.: Deepx: A software accelerator for low-power deep learning inference on mobile devices. In: 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). pp. 1–12 (2016).https://doi.org/10. 1109/IPSN.2016.7460664

  36. [36]

    In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

    Lee, J., Joo, H.: Mocap everyone everywhere: Lightweight motion capture with smartwatches and a head-mounted camera. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 1091–1100 (2024)

  37. [37]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Li, C., Tang, T., Wang, G., Peng, J., Wang, B., Liang, X., Chang, X.: Boss- nas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12281–12291 (2021)

  38. [38]

    In: Proceedings of the 2018 Workshop on MobileEdgeCommunications.p.31–36.MECOMM’18,AssociationforComputing Machinery, New York, NY, USA (2018).https://doi.org/10.1145/3229556

    Li, E., Zhou, Z., Chen, X.: Edge intelligence: On-demand deep learning model co-inference with device-edge synergy. In: Proceedings of the 2018 Workshop on MobileEdgeCommunications.p.31–36.MECOMM’18,AssociationforComputing Machinery, New York, NY, USA (2018).https://doi.org/10.1145/3229556. 3229562,https://doi.org/10.1145/3229556.3229562

  39. [39]

    Persona-l has entered the chat: Leveraging llms and ability-based framework for personas of people with complex needs

    Li, J.N., Zhang, Z.J., Ma, J.: Omniquery: Contextually augmenting captured mul- timodal memories to enable personal question answering. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. CHI ’25, Associ- ation for Computing Machinery, New York, NY, USA (2025).https://doi.org/ 10.1145/3706598.3713448,https://doi.org/10.1145/3...

  40. [40]

    In: Proceedings of the 40th International Conference on Machine Learning

    Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

  41. [41]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Lin, R., Weng, P., Wang, Y., Ding, H., Han, J., Wang, F.: Hilots: High-low tem- poral sensitive representation learning for semi-supervised lidar segmentation in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1429–1438 (June 2025) 18 Y. Zhao et al

  42. [42]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

  43. [43]

    IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(9), 2971–2989 (2021).https://doi.org/10.1109/TPAMI.2021.3052758

    Lu, Z., Sreekumar, G., Goodman, E., Banzhaf, W., Deb, K., Boddeti, V.N.: Neural architecture transfer. IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(9), 2971–2989 (2021).https://doi.org/10.1109/TPAMI.2021.3052758

  44. [44]

    Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. Tech. rep. (2013)

  45. [45]

    In: International Conference on Learning Representa- tions (2022)

    Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile- friendly vision transformer. In: International Conference on Learning Representa- tions (2022)

  46. [46]

    Advances in Neural Information Processing Systems36, 5765– 5777 (2023)

    Mirza, M.J., Karlinsky, L., Lin, W., Possegger, H., Kozinski, M., Feris, R., Bischof, H.: Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections. Advances in Neural Information Processing Systems36, 5765– 5777 (2023)

  47. [47]

    Emogen: Emotional image content generation with text-to-image diffusion models,

    Moon, G., Weipeng, X., Joshi, R., Chenglei, W., Shiratori, T.: Authentic hand avatar from a phone scan via universal hand model. In: 2024 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 2029–2038 (2024). https://doi.org/10.1109/CVPR52733.2024.00198

  48. [48]

    IEEE Transactions on Circuits and Systems II: Express Briefs 68(9), 3078–3082 (2021)

    Nazhamaiti, M., Xu, H., Liu, Z., Chen, Y., Wei, Q., Wu, X., Qiao, F.: Ns-md: near-sensor motion detection with energy harvesting image sensor for always-on visual perception. IEEE Transactions on Circuits and Systems II: Express Briefs 68(9), 3078–3082 (2021)

  49. [49]

    OpenAI: Introducing gpt-5.https://openai.com/index/introducing- gpt- 5/ (August 2025),https://openai.com/index/introducing-gpt-5/, large language model

  50. [50]

    Transactions on Machine Learning Research Journal pp

    Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal pp. 1–31 (2024)

  51. [51]

    In: 2012 IEEE conference on computer vision and pattern recognition

    Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 3498–3505. IEEE (2012)

  52. [52]

    In: Proc

    Raaen, K., Kjellmo, I.: Measuring latency in virtual reality systems. In: En- tertainment Computing - ICEC 2015. p. 457–462. Springer-Verlag, Berlin, Hei- delberg (2022).https://doi.org/10.1007/978- 3- 319- 24589- 8_40,https: //doi.org/10.1007/978-3-319-24589-8_40

  53. [53]

    In: International conference on machine learning

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

  54. [54]

    In: International conference on machine learning

    Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Tan, J., Le, Q.V., Ku- rakin, A.: Large-scale evolution of image classifiers. In: International conference on machine learning. pp. 2902–2911. PMLR (2017)

  55. [55]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Roth, K., Kim, J.M., Koepke, A., Vinyals, O., Schmid, C., Akata, Z.: Waffling around for performance: Visual classification with random words and broad con- cepts. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 15746–15757 (2023)

  56. [56]

    IEEE Com- munications Surveys & Tutorials19(4), 2573–2620 (2017) AdaVFM 19

    Seneviratne, S., Hu, Y., Nguyen, T., Lan, G., Khalifa, S., Thilakarathna, K., Has- san, M., Seneviratne, A.: A survey of wearable devices and challenges. IEEE Com- munications Surveys & Tutorials19(4), 2573–2620 (2017) AdaVFM 19

  57. [57]

    Serianni, A., Kalita, J.: Training-free neural architecture search for RNNs and transformers.In:Rogers,A.,Boyd-Graber,J.,Okazaki,N.(eds.)Proceedingsofthe 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2522–2540. Association for Computational Linguistics, Toronto, Canada (Jul 2023).https://doi.org/10.18653...

  58. [58]

    In: International Conference on Artificial Intelligence and Statistics

    Shrivastava, A., Selvaraju, R.R., Naik, N., Ordonez, V.: Clip-lite: Information ef- ficient visual representation learning with language supervision. In: International Conference on Artificial Intelligence and Statistics. pp. 8433–8447. PMLR (2023)

  59. [59]

    In: 2020 IEEE Hot Chips 32 Symposium (HCS)

    Skillman, A., Edsö, T.: A technical overview of cortex-m55 and ethos-u55: Arm’s most capable processors for endpoint ai. In: 2020 IEEE Hot Chips 32 Symposium (HCS). pp. 1–20 (2020).https://doi.org/10.1109/HCS49909.2020.9220415

  60. [60]

    In: 2016 IEEE 9th Workshop on Software En- gineering and Architectures for Realtime Interactive Systems (SEARIS)

    Stauffert, J.P., Niebling, F., Latoschik, M.E.: Reducing application-stage latencies for real-time interactive systems. In: 2016 IEEE 9th Workshop on Software En- gineering and Architectures for Realtime Interactive Systems (SEARIS). pp. 1–7. IEEE (2016)

  61. [61]

    In: International conference on machine learning

    Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)

  62. [62]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

  63. [63]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16133– 16142 (2023)

  64. [64]

    In: Proceedings of the IEEE/CVF International Conference on Computer Vision

    Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., Wang, X., et al.: Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21970–21980 (2023)

  65. [65]

    In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

    Xia,W.,Feng,R.,Wang,D.,Hu,D.:Phoenix:Amotion-basedself-reflectionframe- work for fine-grained robotic action correction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6981–6990 (June 2025)

  66. [66]

    In: 2010 IEEE computer society conference on computer vision and pattern recognition

    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition. pp. 3485–3492. IEEE (2010)

  67. [67]

    Advances in Neural Information Processing Systems36, 68798–68809 (2023)

    Xing, Y., Kang, J., Xiao, A., Nie, J., Shao, L., Lu, S.: Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. Advances in Neural Information Processing Systems36, 68798–68809 (2023)

  68. [68]

    In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)

    Xing, Z., Zhang, X., Hu, Y., Jiang, B., He, T., Zhang, Q., Long, X., Yin, W.: Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 1602–1611 (June 2025)

  69. [69]

    In: The Twelfth International Conference on Learning Representations (2024)

    Xu, H., Xie, S., Tan, X., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying CLIP data. In: The Twelfth International Conference on Learning Representations (2024)

  70. [70]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

    Xu, J., Hou, J., Zhang, Y., Feng, R., Wang, Y., Qiao, Y., Xie, W.: Learning open- vocabulary semantic segmentation models from natural language supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 2935–2944 (2023) 20 Y. Zhao et al

  71. [71]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Yu, J., Huang, T.S.: Universally slimmable networks and improved training tech- niques. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1803–1811 (2019)

  72. [72]

    In: Computer Vision–ECCV 2020: 16th European Conference, Part VII 16

    Yu, J., Jin, P., Liu, H., Bender, G., Kindermans, P.J., Tan, M., Huang, T., Song, X., Pang, R., Le, Q.: Bignas: Scaling up neural architecture search with big single- stage models. In: Computer Vision–ECCV 2020: 16th European Conference, Part VII 16. pp. 702–717. Springer (2020)

  73. [73]

    In: Proceedings of the IEEE/CVF international conference on computer vision

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

  74. [74]

    In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

    Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18123– 18133 (2022)

  75. [75]

    Advances in Neural Information Processing Systems35, 36067–36080 (2022)

    Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: Glipv2: Unifying localization and vision-language under- standing. Advances in Neural Information Processing Systems35, 36067–36080 (2022)

  76. [76]

    In: Proceedings of the 30th Asia and South Pacific Design Automation Conference

    Zhao, Y., Chen, J., Zhang, S.Q., Sarwar, S.S., Stangherlin, K.H., Gomez, J.T., Seo, J.S., De Salvo, B., Liu, C., Gibbons, P.B., Li, Z.: H4h: Hybrid convolution- transformer architecture search for npu-cim heterogeneous systems for ar/vr ap- plications. In: Proceedings of the 30th Asia and South Pacific Design Automation Conference. p. 1133–1141. ASPDAC ’2...

  77. [77]

    In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

    Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

  78. [78]

    International Journal of Computer Vision127(3), 302–321 (2019)

    Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision127(3), 302–321 (2019)

  79. [79]

    In: International Conference on Learning Representations (ICLR) (2022)

    Zhou, J., Yu, X., Luo, P., et al.: ibot: Image bert pre-training with online tokenizer. In: International Conference on Learning Representations (ICLR) (2022)

  80. [80]

    Zoph, B., Le, Q.: Neural architecture search with reinforcement learning. In: In- ternational Conference on Learning Representations (2016) AdaVFM: Supplementary Material A Hardware Platform and Evaluation Setup We adopt the ARM Ethos-U55 [5,58] as a representative edge Neural Processing Unit (NPU). The test silicon (Fig. 1) is fabricated in 7nm FinFET an...

Showing first 80 references.