AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution

Barbara De Salvo; Chiao Liu; Cijo Jose; Huapeng Su; Jieyu Lin; Michael Ramamonjisoa; Patrick Labatut; Phillip B. Gibbons; Stefano Ambrogio; Yiwei Zhao

arxiv: 2604.15622 · v2 · submitted 2026-04-17 · 💻 cs.CV · cs.LG

AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution

Yiwei Zhao , Yi Zheng , Huapeng Su , Jieyu Lin , Stefano Ambrogio , Cijo Jose , Michael Ramamonjisoa , Patrick Labatut

show 4 more authors

Barbara De Salvo Chiao Liu Phillip B. Gibbons Ziyun Li

This is my paper

Pith reviewed 2026-05-10 08:34 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords adaptive inferencevision foundation modelsedge computingLLM guidanceneural architecture searchon-device AIzero-shot classificationopen-vocabulary segmentation

0 comments

The pith

AdaVFM dynamically scales vision foundation models at runtime via LLM guidance to improve accuracy-efficiency trade-offs on edge devices.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces an adaptive framework for running language-aligned vision foundation models on edge hardware with tight latency and power limits. It creates a family of lightweight model variants through neural architecture search and uses a cloud-based multimodal LLM to select the appropriate variant based on scene context and task difficulty. This approach matters because fixed large models exceed edge constraints while fixed small models lose accuracy on complex inputs, and the task-dependent nature of size reduction allows dynamic choices to improve overall performance. Experiments on zero-shot classification and open-vocabulary segmentation confirm gains over static baselines.

Core claim

The central claim is that the performance impact of model size reduction varies by task and scene in vision applications, so a runtime-adaptive execution strategy can maintain high accuracy while cutting average computation. AdaVFM embeds neural architecture search into the vision foundation model backbone to produce executable subnets of different sizes. A multimodal LLM agent deployed on the cloud provides context-aware control to select the right subnet during inference, enabling efficient adaptation across conditions.

What carries the argument

The runtime selection of NAS-derived subnets in the language-aligned VFM backbone, guided by a multimodal LLM agent for context-aware computation scaling.

If this is right

Surpasses prior adaptive and static methods by up to 7.9% top-1 accuracy on ImageNet-1K for models of comparable size.
Delivers up to 5.2% higher mean IoU on ADE20K for segmentation models of similar scale.
Reduces average FLOPs by up to 77.9% while preserving comparable accuracy levels.
Enables practical zero-shot classification and open-vocabulary segmentation under edge latency and power limits.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The cloud-edge split with LLM control could extend to other foundation models where input difficulty varies across samples.
Runtime adaptation may lower average energy use in continuous mobile operation beyond what static compression achieves.
End-to-end training of the selection agent with the vision subnets might further tighten the accuracy-efficiency curve.

Load-bearing premise

The accuracy loss from using smaller model variants varies enough by scene and task that dynamic selection yields a better overall trade-off than any fixed size.

What would settle it

A controlled test on inputs where accuracy degradation from model compression is identical regardless of scene complexity or task difficulty, showing no benefit from adaptation over the best static model.

Figures

Figures reproduced from arXiv: 2604.15622 by Barbara De Salvo, Chiao Liu, Cijo Jose, Huapeng Su, Jieyu Lin, Michael Ramamonjisoa, Patrick Labatut, Phillip B. Gibbons, Stefano Ambrogio, Yiwei Zhao, Yi Zheng, Ziyun Li.

**Figure 1.** Figure 1: Left (a): Always-on smart glasses with on-device VFM. Right (b): End-toend mIoU on open-vocabulary ADE20K segmentation [77]. Our design significantly improves mIoU by up to 5.2%, and reduces FLOPs by up to 77.9% over prior models. Deploying vision foundation models (VFMs) on edge devices, however, remains challenging, and enabling language-aligned VFMs is even more difficult. These models are large, comp… view at source ↗

**Figure 2.** Figure 2: Open-vocabulary segmentation on ADE [77] using 50M and 300M-parameter VFMs (300M: DINO.txt [29]; 50M: similarly distilled and fine-tuned). Results are shown for both the original 150-class and the grouped 9-class setting (grouping based on WordNet [18], described in § 6.2). Left: Both models perform similarly on the simpler 9-class task, but in the complex 150-class setting, VFM-300M yields clear object b… view at source ↗

**Figure 3.** Figure 3: Overview of the proposed AdaVFM. Left: Edge-side execution, where the adaptive Vision Encoder follows cloud agent instructions to select an efficient execution scheme and perform vision-text contrastive inference. Right: Cloud-side execution, where the agent uses scene/context information to generate semantic understanding for the Text Encoder and execution guidance for the Vision Encoder. scene and contex… view at source ↗

**Figure 4.** Figure 4: Left (a): Operation flow: LLM-based runtime management agent. Right (b): Overview of the training pipeline. The vision backbone is first distilled from a foundation model (DINOv2 [49]), followed by vision-text alignment using CLIP. Both stages employ NAS and sandwich sampling [70]. 3.3 LLM-Guided Efficient Runtime Execution A central component of our system is the LLM-guided runtime management agent, invo… view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Left (a): Impact of the LLM runtime agent on open-vocabulary segmentation on ADE20K [77]. NAS uses text-aligned subnets without the LLM agent; AdaVFM w/o Subnet Selection uses LLM only for semantic class filtering; AdaVFM uses the LLM for both semantic class filtering and adaptive subnet selection. Efficiency gains show that runtime adaptive selection is critical. Right (b): Subnet selection across α. Larg… view at source ↗

**Figure 1.** Figure 1: Our ARM Ethos-U55 test silicon. B Architecture of Basic Blocks B.1 ConvNext-v2 Blocks We use ConvNeXt-v2 [62] with selective capacity as the core building block of our model, as shown in Fig. 2a. All block widths (dims) are selectable during training and runtime, enabling the adaptive behavior of our model. We also replace GELU with ReLU for better compatibility and efficiency on edge devices. B.2 Downsamp… view at source ↗

**Figure 2.** Figure 2: (a) Left: Selective ConvNeXt-v2 blocks. (b) Right: Downsample layers. The block widths (dims) are selectable during training and runtime. Downsample Layer 1 uses two consecutive 3 × 3-Conv2D layers with stride 2, yielding an effective downsampling factor of 4. Downsample Layer 2 uses a single 3 × 3-Conv2D layer with stride 2. Downsample Layers 3 and 4 instead apply 1 × 1-Conv2D layers with stride 2. C Grou… view at source ↗

**Figure 3.** Figure 3: Left (a): End-to-end trade-off between mIoU on open-vocabulary ADE20K segmentation [77] and average execution latency. Right (b): End-to-end trade-off between mIoU and average energy consumption. E End-to-End Accuracy-Efficiency Trade-offs with Additional Metrics In the main paper, we present the accuracy-efficiency trade-off (Fig. 1b in the main paper) using FLOPs. Here, we additionally report results ba… view at source ↗

read the original abstract

Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to $7.9\%$ in acc@1 on IN1K and $5.2\%$ mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to $77.9\%$.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaVFM pairs NAS-derived subnets in language-aligned vision models with a cloud LLM for runtime selection, aiming at better edge trade-offs, but the abstract gives almost no experimental detail to judge the numbers.

read the letter

The main contribution is a runtime-adaptive system that searches lightweight subnets inside a vision foundation model backbone and uses a multimodal LLM on the cloud to pick which one to run based on scene context and task. The motivation is that model compression hurts performance differently across vision tasks, so static pruning is suboptimal for edge devices. That framing is straightforward and practical for always-on contextual AI.

Referee Report

2 major / 2 minor

Summary. The paper proposes AdaVFM, an adaptive framework for on-device inference of language-aligned vision foundation models (VFMs). It integrates neural architecture search (NAS) into the VFM backbone to enable runtime execution of lightweight subnets and uses a cloud-deployed multimodal LLM as a context-aware agent to dynamically select the execution path based on scene context and task complexity. The central claim is that this approach exploits the task-dependent impact of model size reduction to achieve superior accuracy-efficiency trade-offs, with reported gains of up to 7.9% top-1 accuracy on ImageNet-1K zero-shot classification and 5.2% mIoU on ADE20K open-vocabulary segmentation, plus up to 77.9% average FLOPs reduction for comparable accuracy.

Significance. If the experimental results are reproducible and the adaptive mechanism is shown to be the primary driver, this could meaningfully advance edge deployment of large VFMs by offering a practical runtime adaptation strategy without retraining. The cross-modal use of an LLM agent for control is a notable design choice that could generalize to other adaptive inference settings.

major comments (2)

[Abstract and §4] Abstract and §4 (Experiments): The stated improvements (7.9% acc@1 on IN1K, 5.2% mIoU on ADE20K, 77.9% FLOPs reduction) are presented without any description of baselines, model sizes compared, number of runs, variance, or statistical tests. This directly undermines evaluation of the central accuracy-efficiency claim.
[§3] §3 (Method): The motivating assumption that 'the effect of model size reduction on performance is task-dependent' is used to justify the entire adaptive NAS+LLM design, yet no controlled ablation or analysis is referenced showing how performance degradation varies across tasks/scenes to support the runtime selection policy.

minor comments (2)

[§3.3] Clarify the exact interface between the on-device NAS subnets and the cloud LLM agent, including latency overhead of the control loop and any assumptions about network connectivity.
Ensure consistent terminology for 'subnet' vs. 'model size' throughout; the NAS integration description would benefit from a diagram of the searchable space.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying our experimental setup where possible and committing to revisions that strengthen the presentation of results and the justification for our design choices.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): The stated improvements (7.9% acc@1 on IN1K, 5.2% mIoU on ADE20K, 77.9% FLOPs reduction) are presented without any description of baselines, model sizes compared, number of runs, variance, or statistical tests. This directly undermines evaluation of the central accuracy-efficiency claim.

Authors: We agree that the abstract and §4 would benefit from greater explicitness to allow full evaluation of the claims. The manuscript already compares against fixed-size VFM baselines (e.g., CLIP-ViT variants, BLIP, and prior NAS methods) of comparable parameter counts and FLOPs, with the 7.9% and 5.2% figures representing the maximum observed gains over the strongest such baseline at each operating point. In the revised version we will (i) expand the abstract and §4 to list the exact baseline models and their sizes, (ii) report results averaged over 3–5 runs with standard deviations, and (iii) add paired statistical significance tests for the key accuracy and FLOPs differences. These additions will be placed in a new “Evaluation Protocol” subsection of §4. revision: yes
Referee: [§3] §3 (Method): The motivating assumption that 'the effect of model size reduction on performance is task-dependent' is used to justify the entire adaptive NAS+LLM design, yet no controlled ablation or analysis is referenced showing how performance degradation varies across tasks/scenes to support the runtime selection policy.

Authors: The core insight is indeed that performance sensitivity to model size varies with scene complexity and task type; this is what enables the LLM agent to select subnets profitably at runtime. While §4 already shows that AdaVFM outperforms fixed-size models on two distinct tasks (zero-shot classification and open-vocabulary segmentation), we acknowledge that a more targeted, controlled demonstration of the variation itself would strengthen the motivation. In the revised manuscript we will add a dedicated ablation (new Figure or subsection in §3 or §4) that measures accuracy degradation for the same set of subnets across controlled subsets of ImageNet and ADE20K stratified by scene complexity (e.g., object density, lighting variation). This will directly illustrate the task/scene dependence that justifies the adaptive policy. revision: yes

Circularity Check

0 steps flagged

No circularity in derivation chain

full rationale

The paper describes an engineering framework for adaptive VFM inference using NAS and cloud LLM control, motivated by the empirical observation that model size reduction effects are task-dependent. No equations, first-principles derivations, or predictions are presented that reduce to inputs by construction. All performance claims rest on external benchmark comparisons (IN1K, ADE20K) against prior baselines, with no self-citation load-bearing steps or fitted parameters renamed as predictions. The derivation chain is self-contained against external experimental evidence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies insufficient technical detail to enumerate free parameters, axioms, or invented entities; the approach appears to rest on standard NAS and LLM components whose concrete realizations are not specified.

pith-pipeline@v0.9.0 · 5568 in / 1207 out tokens · 61214 ms · 2026-05-10T08:34:04.739899+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

87 extracted references · 87 canonical work pages · 1 internal anchor

[1]

Meta ray-ban smart glasses.https://www.meta.com/ai-glasses/ray-ban-meta/ (2023), a series of AI-enabled smart glasses combining camera, audio, and voice- controlled Meta AI features, developed by Meta Platforms in partnership with Ray-Ban

work page 2023
[2]

In: The Twelfth International Conference on Learning Representations (2024)

Abbaspourazad, S., Elachqar, O., Miller, A., Emrani, S., Nallasamy, U., Shapiro, I.: Large-scale training of foundation models for wearable biosignals. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024
[3]

Kumar,et al., Sub-200Ω·𝜇m alloyed contacts to synthetic monolayer MoS2, in2021 IEEE International Electron Devices Meeting (IEDM)(IEEE) (2021), pp

Abrash, M.: Creating the future: Augmented reality, the next human-machine in- terface. In: 2021 IEEE International Electron Devices Meeting (IEDM). pp. 1–11 (2021).https://doi.org/10.1109/IEDM19574.2021.9720526

work page doi:10.1109/iedm19574.2021.9720526 2021
[4]

In: Proceedings of the 36th International Conference on Neural Information Pro- cessing Systems

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millicah, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisser- man, A., Simonyan, K.: Flamingo:...

work page 2022
[5]

Arm®: Arm ethos-u55 micronpu description.https://www.arm.com/products/ silicon-ip-cpu/ethos/ethos-u55(Accessed 2026-03)

work page 2026
[6]

In: European Conference on Computer Vision (2014)

Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative com- ponents with random forests. In: European Conference on Computer Vision (2014)

work page 2014
[7]

In: Proceedings of the 20th ACM International Conference on Multi- modal Interaction

Brun,D.:Multimodalandcontext-awareinteractioninaugmentedrealityforactive assistance. In: Proceedings of the 20th ACM International Conference on Multi- modal Interaction. pp. 506–510 (2018)

work page 2018
[8]

In: International Conference on Learning Representations (2020),https://openreview.net/forum?id=HylxE1HKwS

Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S.: Once-for-all: Train one network and specialize it for efficient deployment. In: International Conference on Learning Representations (2020),https://openreview.net/forum?id=HylxE1HKwS

work page 2020
[9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, J., Hu, J., Wang, G., Jiang, Z., Zhou, T., Chen, Z., Lv, C.: Taoavatar: Real-time lifelike full-body talking avatars for augmented reality via 3d gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10723–10734 (June 2025)

work page 2025
[10]

In: The Eleventh International Conference on Learning Representations (2023)

Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. In: The Eleventh International Conference on Learning Representations (2023)

work page 2023
[11]

In: Proceedings of the IEEE/CVF International Conference on computer vision

Chu, X., Zhang, B., Xu, R.: Fairnas: Rethinking evaluation fairness of weight shar- ing neural architecture search. In: Proceedings of the IEEE/CVF International Conference on computer vision. pp. 12239–12248 (2021) 16 Y. Zhao et al

work page 2021
[12]

In: Proceedings of the IEEE Conf

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2014)

work page 2014
[13]

In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009)

work page 2009
[14]

In: Proceedings of the 40th International Conference on Machine Learning

Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., Florence, P.: Palm-e: an embodied multimodal language model. In: Proceedings of the 40th Inter...

work page 2023
[15]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV

work page
[16]

pp. 75–92. Springer Nature Switzerland, Cham (2025)

work page 2025
[17]

In: The Twelfth International Conference on Learning Repre- sentations (2024),https://openreview.net/forum?id=KAk6ngZ09F

Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A.T., Shankar, V.: Data filtering networks. In: The Twelfth International Conference on Learning Repre- sentations (2024),https://openreview.net/forum?id=KAk6ngZ09F

work page 2024
[18]

Computer vision and Image understanding106(1), 59–70 (2007)

Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object cate- gories. Computer vision and Image understanding106(1), 59–70 (2007)

work page 2007
[19]

MIT press (1998)

Fellbaum, C.: WordNet: An electronic lexical database. MIT press (1998)

work page 1998
[20]

Fundamental AI Research, M.: Introducing llama 4: Advancing multimodal intelli- gence.https://ai.meta.com/blog/llama-4-multimodal-intelligence/(2024), accessed April 5, 2025

work page 2024
[21]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Fei-Fei, L.: Fine-grained car detection for visual census estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 31 (2017)

work page 2017
[22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Georg, M., Tanzer, G., Uboweja, E., Hassan, S., Shengelia, M., Sepah, S., Forbes, S., Starner, T.: Fsboard: Over 3 million characters of asl fingerspelling collected via smartphones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13897–13906 (June 2025)

work page 2025
[23]

In: European conference on computer vision

Ghiasi,G.,Gu,X.,Cui,Y.,Lin,T.Y.:Scalingopen-vocabularyimagesegmentation with image-level labels. In: European conference on computer vision. pp. 540–557. Springer (2022)

work page 2022
[24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Goswami, R.G., Krishnamurthy, P., LeCun, Y., Khorrami, F.: Robopepp: Vision- based robot pose and joint angle estimation through embedding predictive pre- training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6930–6939 (June 2025)

work page 2025
[25]

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalablevisionlearners.In:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR) (2022)

work page 2022
[26]

Ieee Access (2025)

Hoang, M.L.: A comprehensive review of machine learning, and deep learning in wearable iot devices. Ieee Access (2025)

work page 2025
[27]

Le and Hartwig Adam , year =

Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L.C., Tan, M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., Le, Q.: Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1314–1324 (2019).https://doi.org/10.1109/ICCV.2019.00140 AdaVFM 17

work page doi:10.1109/iccv.2019.00140 2019
[28]

Zenodo (2021)

Ilharco, G., Wortsman, M., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., et al.: Openclip. Zenodo (2021)

work page 2021
[29]

Fiaz, Al- ham Fikri Aji, and Hisham Cholakkal

Imam, M.F., Marew, R.F., Hassan, J., Fiaz, M., Aji, A.F., Cholakkal, H.: Clip meets dino for tuning zero-shot classifier using unlabeled image collections. arXiv preprint arXiv:2411.19346 (2024)

work page arXiv 2024
[30]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al.: Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24905–24916 (2025)

work page 2025
[31]

Google DeepMind Blog (Mar 2025)

Kavukcuoglu, K., Pichai, S., Hassabis, D., Walker, K., Manyika, J., Porat, R.: Gemini-2.5: Our most intelligent ai model. Google DeepMind Blog (Mar 2025)

work page 2025
[32]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Khattak, M.U., Naeem, M.F., Naseer, M., Van Gool, L., Tombari, F.: Learning to prompt with text only supervision for vision-language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4230–4238 (2025)

work page 2025
[33]

In: 2023 IEEE International Conference on Consumer Electronics (ICCE)

Kim, S.Y., Chung, D.o., Lee, K., Lee, C., Huh, J.: Low-power always-on cam- era (aoc) system with workload offloading to cmos image sensor. In: 2023 IEEE International Conference on Consumer Electronics (ICCE). pp. 1–2. IEEE (2023)

work page 2023
[34]

Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)

work page 2009
[35]

In: 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)

Lane, N.D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, L., Qendro, L., Kawsar, F.: Deepx: A software accelerator for low-power deep learning inference on mobile devices. In: 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). pp. 1–12 (2016).https://doi.org/10. 1109/IPSN.2016.7460664

work page arXiv 2016
[36]

In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

Lee, J., Joo, H.: Mocap everyone everywhere: Lightweight motion capture with smartwatches and a head-mounted camera. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 1091–1100 (2024)

work page 2024
[37]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, C., Tang, T., Wang, G., Peng, J., Wang, B., Liang, X., Chang, X.: Boss- nas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12281–12291 (2021)

work page 2021
[38]

In: Proceedings of the 2018 Workshop on MobileEdgeCommunications.p.31–36.MECOMM’18,AssociationforComputing Machinery, New York, NY, USA (2018).https://doi.org/10.1145/3229556

Li, E., Zhou, Z., Chen, X.: Edge intelligence: On-demand deep learning model co-inference with device-edge synergy. In: Proceedings of the 2018 Workshop on MobileEdgeCommunications.p.31–36.MECOMM’18,AssociationforComputing Machinery, New York, NY, USA (2018).https://doi.org/10.1145/3229556. 3229562,https://doi.org/10.1145/3229556.3229562

work page doi:10.1145/3229556 2018
[39]

Persona-l has entered the chat: Leveraging llms and ability-based framework for personas of people with complex needs

Li, J.N., Zhang, Z.J., Ma, J.: Omniquery: Contextually augmenting captured mul- timodal memories to enable personal question answering. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. CHI ’25, Associ- ation for Computing Machinery, New York, NY, USA (2025).https://doi.org/ 10.1145/3706598.3713448,https://doi.org/10.1145/3...

work page doi:10.1145/3706598.3713448 2025
[40]

In: Proceedings of the 40th International Conference on Machine Learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

work page 2023
[41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Lin, R., Weng, P., Wang, Y., Ding, H., Han, J., Wang, F.: Hilots: High-low tem- poral sensitive representation learning for semi-supervised lidar segmentation in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1429–1438 (June 2025) 18 Y. Zhao et al

work page 2025
[42]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

work page 2022
[43]

IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(9), 2971–2989 (2021).https://doi.org/10.1109/TPAMI.2021.3052758

Lu, Z., Sreekumar, G., Goodman, E., Banzhaf, W., Deb, K., Boddeti, V.N.: Neural architecture transfer. IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(9), 2971–2989 (2021).https://doi.org/10.1109/TPAMI.2021.3052758

work page doi:10.1109/tpami.2021.3052758 2021
[44]

Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. Tech. rep. (2013)

work page 2013
[45]

In: International Conference on Learning Representa- tions (2022)

Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile- friendly vision transformer. In: International Conference on Learning Representa- tions (2022)

work page 2022
[46]

Advances in Neural Information Processing Systems36, 5765– 5777 (2023)

Mirza, M.J., Karlinsky, L., Lin, W., Possegger, H., Kozinski, M., Feris, R., Bischof, H.: Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections. Advances in Neural Information Processing Systems36, 5765– 5777 (2023)

work page 2023
[47]

Emogen: Emotional image content generation with text-to-image diffusion models,

Moon, G., Weipeng, X., Joshi, R., Chenglei, W., Shiratori, T.: Authentic hand avatar from a phone scan via universal hand model. In: 2024 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 2029–2038 (2024). https://doi.org/10.1109/CVPR52733.2024.00198

work page doi:10.1109/cvpr52733.2024.00198 2024
[48]

IEEE Transactions on Circuits and Systems II: Express Briefs 68(9), 3078–3082 (2021)

Nazhamaiti, M., Xu, H., Liu, Z., Chen, Y., Wei, Q., Wu, X., Qiao, F.: Ns-md: near-sensor motion detection with energy harvesting image sensor for always-on visual perception. IEEE Transactions on Circuits and Systems II: Express Briefs 68(9), 3078–3082 (2021)

work page 2021
[49]

OpenAI: Introducing gpt-5.https://openai.com/index/introducing- gpt- 5/ (August 2025),https://openai.com/index/introducing-gpt-5/, large language model

work page 2025
[50]

Transactions on Machine Learning Research Journal pp

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal pp. 1–31 (2024)

work page 2024
[51]

In: 2012 IEEE conference on computer vision and pattern recognition

Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 3498–3505. IEEE (2012)

work page 2012
[52]

In: Proc

Raaen, K., Kjellmo, I.: Measuring latency in virtual reality systems. In: En- tertainment Computing - ICEC 2015. p. 457–462. Springer-Verlag, Berlin, Hei- delberg (2022).https://doi.org/10.1007/978- 3- 319- 24589- 8_40,https: //doi.org/10.1007/978-3-319-24589-8_40

work page doi:10.1007/978- 2015
[53]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021
[54]

In: International conference on machine learning

Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Tan, J., Le, Q.V., Ku- rakin, A.: Large-scale evolution of image classifiers. In: International conference on machine learning. pp. 2902–2911. PMLR (2017)

work page 2017
[55]

In: Proceedings of the IEEE/CVF international conference on computer vision

Roth, K., Kim, J.M., Koepke, A., Vinyals, O., Schmid, C., Akata, Z.: Waffling around for performance: Visual classification with random words and broad con- cepts. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 15746–15757 (2023)

work page 2023
[56]

IEEE Com- munications Surveys & Tutorials19(4), 2573–2620 (2017) AdaVFM 19

Seneviratne, S., Hu, Y., Nguyen, T., Lan, G., Khalifa, S., Thilakarathna, K., Has- san, M., Seneviratne, A.: A survey of wearable devices and challenges. IEEE Com- munications Surveys & Tutorials19(4), 2573–2620 (2017) AdaVFM 19

work page 2017
[57]

Serianni, A., Kalita, J.: Training-free neural architecture search for RNNs and transformers.In:Rogers,A.,Boyd-Graber,J.,Okazaki,N.(eds.)Proceedingsofthe 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2522–2540. Association for Computational Linguistics, Toronto, Canada (Jul 2023).https://doi.org/10.18653...

work page doi:10.18653/v1/2023.acl-long.142 2023
[58]

In: International Conference on Artificial Intelligence and Statistics

Shrivastava, A., Selvaraju, R.R., Naik, N., Ordonez, V.: Clip-lite: Information ef- ficient visual representation learning with language supervision. In: International Conference on Artificial Intelligence and Statistics. pp. 8433–8447. PMLR (2023)

work page 2023
[59]

In: 2020 IEEE Hot Chips 32 Symposium (HCS)

Skillman, A., Edsö, T.: A technical overview of cortex-m55 and ethos-u55: Arm’s most capable processors for endpoint ai. In: 2020 IEEE Hot Chips 32 Symposium (HCS). pp. 1–20 (2020).https://doi.org/10.1109/HCS49909.2020.9220415

work page doi:10.1109/hcs49909.2020.9220415 2020
[60]

In: 2016 IEEE 9th Workshop on Software En- gineering and Architectures for Realtime Interactive Systems (SEARIS)

Stauffert, J.P., Niebling, F., Latoschik, M.E.: Reducing application-stage latencies for real-time interactive systems. In: 2016 IEEE 9th Workshop on Software En- gineering and Architectures for Realtime Interactive Systems (SEARIS). pp. 1–7. IEEE (2016)

work page 2016
[61]

In: International conference on machine learning

Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)

work page 2019
[62]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023
[63]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16133– 16142 (2023)

work page 2023
[64]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., Wang, X., et al.: Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21970–21980 (2023)

work page 2023
[65]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xia,W.,Feng,R.,Wang,D.,Hu,D.:Phoenix:Amotion-basedself-reflectionframe- work for fine-grained robotic action correction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6981–6990 (June 2025)

work page 2025
[66]

In: 2010 IEEE computer society conference on computer vision and pattern recognition

Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition. pp. 3485–3492. IEEE (2010)

work page 2010
[67]

Advances in Neural Information Processing Systems36, 68798–68809 (2023)

Xing, Y., Kang, J., Xiao, A., Nie, J., Shao, L., Lu, S.: Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. Advances in Neural Information Processing Systems36, 68798–68809 (2023)

work page 2023
[68]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)

Xing, Z., Zhang, X., Hu, Y., Jiang, B., He, T., Zhang, Q., Long, X., Yin, W.: Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 1602–1611 (June 2025)

work page 2025
[69]

In: The Twelfth International Conference on Learning Representations (2024)

Xu, H., Xie, S., Tan, X., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying CLIP data. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024
[70]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Xu, J., Hou, J., Zhang, Y., Feng, R., Wang, Y., Qiao, Y., Xie, W.: Learning open- vocabulary semantic segmentation models from natural language supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 2935–2944 (2023) 20 Y. Zhao et al

work page 2023
[71]

In: Proceedings of the IEEE/CVF international conference on computer vision

Yu, J., Huang, T.S.: Universally slimmable networks and improved training tech- niques. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1803–1811 (2019)

work page 2019
[72]

In: Computer Vision–ECCV 2020: 16th European Conference, Part VII 16

Yu, J., Jin, P., Liu, H., Bender, G., Kindermans, P.J., Tan, M., Huang, T., Song, X., Pang, R., Le, Q.: Bignas: Scaling up neural architecture search with big single- stage models. In: Computer Vision–ECCV 2020: 16th European Conference, Part VII 16. pp. 702–717. Springer (2020)

work page 2020
[73]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

work page 2023
[74]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18123– 18133 (2022)

work page 2022
[75]

Advances in Neural Information Processing Systems35, 36067–36080 (2022)

Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: Glipv2: Unifying localization and vision-language under- standing. Advances in Neural Information Processing Systems35, 36067–36080 (2022)

work page 2022
[76]

In: Proceedings of the 30th Asia and South Pacific Design Automation Conference

Zhao, Y., Chen, J., Zhang, S.Q., Sarwar, S.S., Stangherlin, K.H., Gomez, J.T., Seo, J.S., De Salvo, B., Liu, C., Gibbons, P.B., Li, Z.: H4h: Hybrid convolution- transformer architecture search for npu-cim heterogeneous systems for ar/vr ap- plications. In: Proceedings of the 30th Asia and South Pacific Design Automation Conference. p. 1133–1141. ASPDAC ’2...

work page doi:10.1145/3658617.3697627 2025
[77]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

work page 2017
[78]

International Journal of Computer Vision127(3), 302–321 (2019)

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision127(3), 302–321 (2019)

work page 2019
[79]

In: International Conference on Learning Representations (ICLR) (2022)

Zhou, J., Yu, X., Luo, P., et al.: ibot: Image bert pre-training with online tokenizer. In: International Conference on Learning Representations (ICLR) (2022)

work page 2022
[80]

Zoph, B., Le, Q.: Neural architecture search with reinforcement learning. In: In- ternational Conference on Learning Representations (2016) AdaVFM: Supplementary Material A Hardware Platform and Evaluation Setup We adopt the ARM Ethos-U55 [5,58] as a representative edge Neural Processing Unit (NPU). The test silicon (Fig. 1) is fabricated in 7nm FinFET an...

work page 2016

Showing first 80 references.

[1] [1]

Meta ray-ban smart glasses.https://www.meta.com/ai-glasses/ray-ban-meta/ (2023), a series of AI-enabled smart glasses combining camera, audio, and voice- controlled Meta AI features, developed by Meta Platforms in partnership with Ray-Ban

work page 2023

[2] [2]

In: The Twelfth International Conference on Learning Representations (2024)

Abbaspourazad, S., Elachqar, O., Miller, A., Emrani, S., Nallasamy, U., Shapiro, I.: Large-scale training of foundation models for wearable biosignals. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024

[3] [3]

Kumar,et al., Sub-200Ω·𝜇m alloyed contacts to synthetic monolayer MoS2, in2021 IEEE International Electron Devices Meeting (IEDM)(IEEE) (2021), pp

Abrash, M.: Creating the future: Augmented reality, the next human-machine in- terface. In: 2021 IEEE International Electron Devices Meeting (IEDM). pp. 1–11 (2021).https://doi.org/10.1109/IEDM19574.2021.9720526

work page doi:10.1109/iedm19574.2021.9720526 2021

[4] [4]

In: Proceedings of the 36th International Conference on Neural Information Pro- cessing Systems

Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millicah, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisser- man, A., Simonyan, K.: Flamingo:...

work page 2022

[5] [5]

Arm®: Arm ethos-u55 micronpu description.https://www.arm.com/products/ silicon-ip-cpu/ethos/ethos-u55(Accessed 2026-03)

work page 2026

[6] [6]

In: European Conference on Computer Vision (2014)

Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative com- ponents with random forests. In: European Conference on Computer Vision (2014)

work page 2014

[7] [7]

In: Proceedings of the 20th ACM International Conference on Multi- modal Interaction

Brun,D.:Multimodalandcontext-awareinteractioninaugmentedrealityforactive assistance. In: Proceedings of the 20th ACM International Conference on Multi- modal Interaction. pp. 506–510 (2018)

work page 2018

[8] [8]

In: International Conference on Learning Representations (2020),https://openreview.net/forum?id=HylxE1HKwS

Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S.: Once-for-all: Train one network and specialize it for efficient deployment. In: International Conference on Learning Representations (2020),https://openreview.net/forum?id=HylxE1HKwS

work page 2020

[9] [9]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Chen, J., Hu, J., Wang, G., Jiang, Z., Zhou, T., Chen, Z., Lv, C.: Taoavatar: Real-time lifelike full-body talking avatars for augmented reality via 3d gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10723–10734 (June 2025)

work page 2025

[10] [10]

In: The Eleventh International Conference on Learning Representations (2023)

Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. In: The Eleventh International Conference on Learning Representations (2023)

work page 2023

[11] [11]

In: Proceedings of the IEEE/CVF International Conference on computer vision

Chu, X., Zhang, B., Xu, R.: Fairnas: Rethinking evaluation fairness of weight shar- ing neural architecture search. In: Proceedings of the IEEE/CVF International Conference on computer vision. pp. 12239–12248 (2021) 16 Y. Zhao et al

work page 2021

[12] [12]

In: Proceedings of the IEEE Conf

Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2014)

work page 2014

[13] [13]

In: 2009 IEEE Conference on Computer Vision and Pattern Recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009)

work page 2009

[14] [14]

In: Proceedings of the 40th International Conference on Machine Learning

Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., Florence, P.: Palm-e: an embodied multimodal language model. In: Proceedings of the 40th Inter...

work page 2023

[15] [15]

In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G

Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV

work page

[16] [16]

pp. 75–92. Springer Nature Switzerland, Cham (2025)

work page 2025

[17] [17]

In: The Twelfth International Conference on Learning Repre- sentations (2024),https://openreview.net/forum?id=KAk6ngZ09F

Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A.T., Shankar, V.: Data filtering networks. In: The Twelfth International Conference on Learning Repre- sentations (2024),https://openreview.net/forum?id=KAk6ngZ09F

work page 2024

[18] [18]

Computer vision and Image understanding106(1), 59–70 (2007)

Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object cate- gories. Computer vision and Image understanding106(1), 59–70 (2007)

work page 2007

[19] [19]

MIT press (1998)

Fellbaum, C.: WordNet: An electronic lexical database. MIT press (1998)

work page 1998

[20] [20]

Fundamental AI Research, M.: Introducing llama 4: Advancing multimodal intelli- gence.https://ai.meta.com/blog/llama-4-multimodal-intelligence/(2024), accessed April 5, 2025

work page 2024

[21] [21]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Fei-Fei, L.: Fine-grained car detection for visual census estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 31 (2017)

work page 2017

[22] [22]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Georg, M., Tanzer, G., Uboweja, E., Hassan, S., Shengelia, M., Sepah, S., Forbes, S., Starner, T.: Fsboard: Over 3 million characters of asl fingerspelling collected via smartphones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13897–13906 (June 2025)

work page 2025

[23] [23]

In: European conference on computer vision

Ghiasi,G.,Gu,X.,Cui,Y.,Lin,T.Y.:Scalingopen-vocabularyimagesegmentation with image-level labels. In: European conference on computer vision. pp. 540–557. Springer (2022)

work page 2022

[24] [24]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Goswami, R.G., Krishnamurthy, P., LeCun, Y., Khorrami, F.: Robopepp: Vision- based robot pose and joint angle estimation through embedding predictive pre- training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6930–6939 (June 2025)

work page 2025

[25] [25]

He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalablevisionlearners.In:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR) (2022)

work page 2022

[26] [26]

Ieee Access (2025)

Hoang, M.L.: A comprehensive review of machine learning, and deep learning in wearable iot devices. Ieee Access (2025)

work page 2025

[27] [27]

Le and Hartwig Adam , year =

Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L.C., Tan, M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., Le, Q.: Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1314–1324 (2019).https://doi.org/10.1109/ICCV.2019.00140 AdaVFM 17

work page doi:10.1109/iccv.2019.00140 2019

[28] [28]

Zenodo (2021)

Ilharco, G., Wortsman, M., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., et al.: Openclip. Zenodo (2021)

work page 2021

[29] [29]

Fiaz, Al- ham Fikri Aji, and Hisham Cholakkal

Imam, M.F., Marew, R.F., Hassan, J., Fiaz, M., Aji, A.F., Cholakkal, H.: Clip meets dino for tuning zero-shot classifier using unlabeled image collections. arXiv preprint arXiv:2411.19346 (2024)

work page arXiv 2024

[30] [30]

In: Proceedings of the Computer Vision and Pattern Recognition Conference

Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al.: Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24905–24916 (2025)

work page 2025

[31] [31]

Google DeepMind Blog (Mar 2025)

Kavukcuoglu, K., Pichai, S., Hassabis, D., Walker, K., Manyika, J., Porat, R.: Gemini-2.5: Our most intelligent ai model. Google DeepMind Blog (Mar 2025)

work page 2025

[32] [32]

In: Proceedings of the AAAI Conference on Artificial Intelligence

Khattak, M.U., Naeem, M.F., Naseer, M., Van Gool, L., Tombari, F.: Learning to prompt with text only supervision for vision-language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4230–4238 (2025)

work page 2025

[33] [33]

In: 2023 IEEE International Conference on Consumer Electronics (ICCE)

Kim, S.Y., Chung, D.o., Lee, K., Lee, C., Huh, J.: Low-power always-on cam- era (aoc) system with workload offloading to cmos image sensor. In: 2023 IEEE International Conference on Consumer Electronics (ICCE). pp. 1–2. IEEE (2023)

work page 2023

[34] [34]

Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)

work page 2009

[35] [35]

In: 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)

Lane, N.D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, L., Qendro, L., Kawsar, F.: Deepx: A software accelerator for low-power deep learning inference on mobile devices. In: 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). pp. 1–12 (2016).https://doi.org/10. 1109/IPSN.2016.7460664

work page arXiv 2016

[36] [36]

In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition

Lee, J., Joo, H.: Mocap everyone everywhere: Lightweight motion capture with smartwatches and a head-mounted camera. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 1091–1100 (2024)

work page 2024

[37] [37]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Li, C., Tang, T., Wang, G., Peng, J., Wang, B., Liang, X., Chang, X.: Boss- nas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12281–12291 (2021)

work page 2021

[38] [38]

In: Proceedings of the 2018 Workshop on MobileEdgeCommunications.p.31–36.MECOMM’18,AssociationforComputing Machinery, New York, NY, USA (2018).https://doi.org/10.1145/3229556

Li, E., Zhou, Z., Chen, X.: Edge intelligence: On-demand deep learning model co-inference with device-edge synergy. In: Proceedings of the 2018 Workshop on MobileEdgeCommunications.p.31–36.MECOMM’18,AssociationforComputing Machinery, New York, NY, USA (2018).https://doi.org/10.1145/3229556. 3229562,https://doi.org/10.1145/3229556.3229562

work page doi:10.1145/3229556 2018

[39] [39]

Persona-l has entered the chat: Leveraging llms and ability-based framework for personas of people with complex needs

Li, J.N., Zhang, Z.J., Ma, J.: Omniquery: Contextually augmenting captured mul- timodal memories to enable personal question answering. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. CHI ’25, Associ- ation for Computing Machinery, New York, NY, USA (2025).https://doi.org/ 10.1145/3706598.3713448,https://doi.org/10.1145/3...

work page doi:10.1145/3706598.3713448 2025

[40] [40]

In: Proceedings of the 40th International Conference on Machine Learning

Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)

work page 2023

[41] [41]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Lin, R., Weng, P., Wang, Y., Ding, H., Han, J., Wang, F.: Hilots: High-low tem- poral sensitive representation learning for semi-supervised lidar segmentation in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1429–1438 (June 2025) 18 Y. Zhao et al

work page 2025

[42] [42]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)

work page 2022

[43] [43]

IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(9), 2971–2989 (2021).https://doi.org/10.1109/TPAMI.2021.3052758

Lu, Z., Sreekumar, G., Goodman, E., Banzhaf, W., Deb, K., Boddeti, V.N.: Neural architecture transfer. IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(9), 2971–2989 (2021).https://doi.org/10.1109/TPAMI.2021.3052758

work page doi:10.1109/tpami.2021.3052758 2021

[44] [44]

Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. Tech. rep. (2013)

work page 2013

[45] [45]

In: International Conference on Learning Representa- tions (2022)

Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile- friendly vision transformer. In: International Conference on Learning Representa- tions (2022)

work page 2022

[46] [46]

Advances in Neural Information Processing Systems36, 5765– 5777 (2023)

Mirza, M.J., Karlinsky, L., Lin, W., Possegger, H., Kozinski, M., Feris, R., Bischof, H.: Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections. Advances in Neural Information Processing Systems36, 5765– 5777 (2023)

work page 2023

[47] [47]

Emogen: Emotional image content generation with text-to-image diffusion models,

Moon, G., Weipeng, X., Joshi, R., Chenglei, W., Shiratori, T.: Authentic hand avatar from a phone scan via universal hand model. In: 2024 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 2029–2038 (2024). https://doi.org/10.1109/CVPR52733.2024.00198

work page doi:10.1109/cvpr52733.2024.00198 2024

[48] [48]

IEEE Transactions on Circuits and Systems II: Express Briefs 68(9), 3078–3082 (2021)

Nazhamaiti, M., Xu, H., Liu, Z., Chen, Y., Wei, Q., Wu, X., Qiao, F.: Ns-md: near-sensor motion detection with energy harvesting image sensor for always-on visual perception. IEEE Transactions on Circuits and Systems II: Express Briefs 68(9), 3078–3082 (2021)

work page 2021

[49] [49]

OpenAI: Introducing gpt-5.https://openai.com/index/introducing- gpt- 5/ (August 2025),https://openai.com/index/introducing-gpt-5/, large language model

work page 2025

[50] [50]

Transactions on Machine Learning Research Journal pp

Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal pp. 1–31 (2024)

work page 2024

[51] [51]

In: 2012 IEEE conference on computer vision and pattern recognition

Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 3498–3505. IEEE (2012)

work page 2012

[52] [52]

In: Proc

Raaen, K., Kjellmo, I.: Measuring latency in virtual reality systems. In: En- tertainment Computing - ICEC 2015. p. 457–462. Springer-Verlag, Berlin, Hei- delberg (2022).https://doi.org/10.1007/978- 3- 319- 24589- 8_40,https: //doi.org/10.1007/978-3-319-24589-8_40

work page doi:10.1007/978- 2015

[53] [53]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)

work page 2021

[54] [54]

In: International conference on machine learning

Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Tan, J., Le, Q.V., Ku- rakin, A.: Large-scale evolution of image classifiers. In: International conference on machine learning. pp. 2902–2911. PMLR (2017)

work page 2017

[55] [55]

In: Proceedings of the IEEE/CVF international conference on computer vision

Roth, K., Kim, J.M., Koepke, A., Vinyals, O., Schmid, C., Akata, Z.: Waffling around for performance: Visual classification with random words and broad con- cepts. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 15746–15757 (2023)

work page 2023

[56] [56]

IEEE Com- munications Surveys & Tutorials19(4), 2573–2620 (2017) AdaVFM 19

Seneviratne, S., Hu, Y., Nguyen, T., Lan, G., Khalifa, S., Thilakarathna, K., Has- san, M., Seneviratne, A.: A survey of wearable devices and challenges. IEEE Com- munications Surveys & Tutorials19(4), 2573–2620 (2017) AdaVFM 19

work page 2017

[57] [57]

Serianni, A., Kalita, J.: Training-free neural architecture search for RNNs and transformers.In:Rogers,A.,Boyd-Graber,J.,Okazaki,N.(eds.)Proceedingsofthe 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2522–2540. Association for Computational Linguistics, Toronto, Canada (Jul 2023).https://doi.org/10.18653...

work page doi:10.18653/v1/2023.acl-long.142 2023

[58] [58]

In: International Conference on Artificial Intelligence and Statistics

Shrivastava, A., Selvaraju, R.R., Naik, N., Ordonez, V.: Clip-lite: Information ef- ficient visual representation learning with language supervision. In: International Conference on Artificial Intelligence and Statistics. pp. 8433–8447. PMLR (2023)

work page 2023

[59] [59]

In: 2020 IEEE Hot Chips 32 Symposium (HCS)

Skillman, A., Edsö, T.: A technical overview of cortex-m55 and ethos-u55: Arm’s most capable processors for endpoint ai. In: 2020 IEEE Hot Chips 32 Symposium (HCS). pp. 1–20 (2020).https://doi.org/10.1109/HCS49909.2020.9220415

work page doi:10.1109/hcs49909.2020.9220415 2020

[60] [60]

In: 2016 IEEE 9th Workshop on Software En- gineering and Architectures for Realtime Interactive Systems (SEARIS)

Stauffert, J.P., Niebling, F., Latoschik, M.E.: Reducing application-stage latencies for real-time interactive systems. In: 2016 IEEE 9th Workshop on Software En- gineering and Architectures for Realtime Interactive Systems (SEARIS). pp. 1–7. IEEE (2016)

work page 2016

[61] [61]

In: International conference on machine learning

Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)

work page 2019

[62] [62]

Llama 2: Open Foundation and Fine-Tuned Chat Models

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)

work page internal anchor Pith review Pith/arXiv arXiv 2023

[63] [63]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16133– 16142 (2023)

work page 2023

[64] [64]

In: Proceedings of the IEEE/CVF International Conference on Computer Vision

Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., Wang, X., et al.: Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21970–21980 (2023)

work page 2023

[65] [65]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

Xia,W.,Feng,R.,Wang,D.,Hu,D.:Phoenix:Amotion-basedself-reflectionframe- work for fine-grained robotic action correction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6981–6990 (June 2025)

work page 2025

[66] [66]

In: 2010 IEEE computer society conference on computer vision and pattern recognition

Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition. pp. 3485–3492. IEEE (2010)

work page 2010

[67] [67]

Advances in Neural Information Processing Systems36, 68798–68809 (2023)

Xing, Y., Kang, J., Xiao, A., Nie, J., Shao, L., Lu, S.: Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. Advances in Neural Information Processing Systems36, 68798–68809 (2023)

work page 2023

[68] [68]

In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)

Xing, Z., Zhang, X., Hu, Y., Jiang, B., He, T., Zhang, Q., Long, X., Yin, W.: Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 1602–1611 (June 2025)

work page 2025

[69] [69]

In: The Twelfth International Conference on Learning Representations (2024)

Xu, H., Xie, S., Tan, X., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying CLIP data. In: The Twelfth International Conference on Learning Representations (2024)

work page 2024

[70] [70]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition

Xu, J., Hou, J., Zhang, Y., Feng, R., Wang, Y., Qiao, Y., Xie, W.: Learning open- vocabulary semantic segmentation models from natural language supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 2935–2944 (2023) 20 Y. Zhao et al

work page 2023

[71] [71]

In: Proceedings of the IEEE/CVF international conference on computer vision

Yu, J., Huang, T.S.: Universally slimmable networks and improved training tech- niques. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1803–1811 (2019)

work page 2019

[72] [72]

In: Computer Vision–ECCV 2020: 16th European Conference, Part VII 16

Yu, J., Jin, P., Liu, H., Bender, G., Kindermans, P.J., Tan, M., Huang, T., Song, X., Pang, R., Le, Q.: Bignas: Scaling up neural architecture search with big single- stage models. In: Computer Vision–ECCV 2020: 16th European Conference, Part VII 16. pp. 702–717. Springer (2020)

work page 2020

[73] [73]

In: Proceedings of the IEEE/CVF international conference on computer vision

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)

work page 2023

[74] [74]

In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition

Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18123– 18133 (2022)

work page 2022

[75] [75]

Advances in Neural Information Processing Systems35, 36067–36080 (2022)

Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: Glipv2: Unifying localization and vision-language under- standing. Advances in Neural Information Processing Systems35, 36067–36080 (2022)

work page 2022

[76] [76]

In: Proceedings of the 30th Asia and South Pacific Design Automation Conference

Zhao, Y., Chen, J., Zhang, S.Q., Sarwar, S.S., Stangherlin, K.H., Gomez, J.T., Seo, J.S., De Salvo, B., Liu, C., Gibbons, P.B., Li, Z.: H4h: Hybrid convolution- transformer architecture search for npu-cim heterogeneous systems for ar/vr ap- plications. In: Proceedings of the 30th Asia and South Pacific Design Automation Conference. p. 1133–1141. ASPDAC ’2...

work page doi:10.1145/3658617.3697627 2025

[77] [77]

In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)

work page 2017

[78] [78]

International Journal of Computer Vision127(3), 302–321 (2019)

Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision127(3), 302–321 (2019)

work page 2019

[79] [79]

In: International Conference on Learning Representations (ICLR) (2022)

Zhou, J., Yu, X., Luo, P., et al.: ibot: Image bert pre-training with online tokenizer. In: International Conference on Learning Representations (ICLR) (2022)

work page 2022

[80] [80]

Zoph, B., Le, Q.: Neural architecture search with reinforcement learning. In: In- ternational Conference on Learning Representations (2016) AdaVFM: Supplementary Material A Hardware Platform and Evaluation Setup We adopt the ARM Ethos-U55 [5,58] as a representative edge Neural Processing Unit (NPU). The test silicon (Fig. 1) is fabricated in 7nm FinFET an...

work page 2016