AdaVFM: Adaptive Vision Foundation Models for Edge Intelligence via LLM-Guided Execution
Pith reviewed 2026-05-10 08:34 UTC · model grok-4.3
The pith
AdaVFM dynamically scales vision foundation models at runtime via LLM guidance to improve accuracy-efficiency trade-offs on edge devices.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that the performance impact of model size reduction varies by task and scene in vision applications, so a runtime-adaptive execution strategy can maintain high accuracy while cutting average computation. AdaVFM embeds neural architecture search into the vision foundation model backbone to produce executable subnets of different sizes. A multimodal LLM agent deployed on the cloud provides context-aware control to select the right subnet during inference, enabling efficient adaptation across conditions.
What carries the argument
The runtime selection of NAS-derived subnets in the language-aligned VFM backbone, guided by a multimodal LLM agent for context-aware computation scaling.
If this is right
- Surpasses prior adaptive and static methods by up to 7.9% top-1 accuracy on ImageNet-1K for models of comparable size.
- Delivers up to 5.2% higher mean IoU on ADE20K for segmentation models of similar scale.
- Reduces average FLOPs by up to 77.9% while preserving comparable accuracy levels.
- Enables practical zero-shot classification and open-vocabulary segmentation under edge latency and power limits.
Where Pith is reading between the lines
- The cloud-edge split with LLM control could extend to other foundation models where input difficulty varies across samples.
- Runtime adaptation may lower average energy use in continuous mobile operation beyond what static compression achieves.
- End-to-end training of the selection agent with the vision subnets might further tighten the accuracy-efficiency curve.
Load-bearing premise
The accuracy loss from using smaller model variants varies enough by scene and task that dynamic selection yields a better overall trade-off than any fixed size.
What would settle it
A controlled test on inputs where accuracy degradation from model compression is identical regardless of scene complexity or task difficulty, showing no benefit from adaptation over the best static model.
Figures
read the original abstract
Language-aligned vision foundation models (VFMs) enable versatile visual understanding for always-on contextual AI, but their deployment on edge devices is hindered by strict latency and power constraints. We present AdaVFM, an adaptive framework for efficient on-device inference of language-aligned VFMs that dynamically adjusts computation based on scene context and task complexity. Our key insight is that the effect of model size reduction on performance is task-dependent in vision applications, motivating a runtime-adaptive execution strategy. AdaVFM integrates neural architecture search (NAS) into the language-aligned VFM backbone to enable lightweight subnet execution during runtime. A multimodal large language model (LLM) deployed on the cloud enables runtime control with a context-aware agent. This synergy allows efficient model adaptation under diverse conditions while maintaining strong accuracy. Extensive experiments on zero-shot classification and open-vocabulary segmentation demonstrate that AdaVFM achieves state-of-the-art accuracy-efficiency trade-offs, surpassing prior baselines by up to $7.9\%$ in acc@1 on IN1K and $5.2\%$ mIoU on ADE20K over the best models of comparable VFM sizes. For models with similar accuracy, AdaVFM further reduces average FLOPs by up to $77.9\%$.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes AdaVFM, an adaptive framework for on-device inference of language-aligned vision foundation models (VFMs). It integrates neural architecture search (NAS) into the VFM backbone to enable runtime execution of lightweight subnets and uses a cloud-deployed multimodal LLM as a context-aware agent to dynamically select the execution path based on scene context and task complexity. The central claim is that this approach exploits the task-dependent impact of model size reduction to achieve superior accuracy-efficiency trade-offs, with reported gains of up to 7.9% top-1 accuracy on ImageNet-1K zero-shot classification and 5.2% mIoU on ADE20K open-vocabulary segmentation, plus up to 77.9% average FLOPs reduction for comparable accuracy.
Significance. If the experimental results are reproducible and the adaptive mechanism is shown to be the primary driver, this could meaningfully advance edge deployment of large VFMs by offering a practical runtime adaptation strategy without retraining. The cross-modal use of an LLM agent for control is a notable design choice that could generalize to other adaptive inference settings.
major comments (2)
- [Abstract and §4] Abstract and §4 (Experiments): The stated improvements (7.9% acc@1 on IN1K, 5.2% mIoU on ADE20K, 77.9% FLOPs reduction) are presented without any description of baselines, model sizes compared, number of runs, variance, or statistical tests. This directly undermines evaluation of the central accuracy-efficiency claim.
- [§3] §3 (Method): The motivating assumption that 'the effect of model size reduction on performance is task-dependent' is used to justify the entire adaptive NAS+LLM design, yet no controlled ablation or analysis is referenced showing how performance degradation varies across tasks/scenes to support the runtime selection policy.
minor comments (2)
- [§3.3] Clarify the exact interface between the on-device NAS subnets and the cloud LLM agent, including latency overhead of the control loop and any assumptions about network connectivity.
- Ensure consistent terminology for 'subnet' vs. 'model size' throughout; the NAS integration description would benefit from a diagram of the searchable space.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment point by point below, clarifying our experimental setup where possible and committing to revisions that strengthen the presentation of results and the justification for our design choices.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): The stated improvements (7.9% acc@1 on IN1K, 5.2% mIoU on ADE20K, 77.9% FLOPs reduction) are presented without any description of baselines, model sizes compared, number of runs, variance, or statistical tests. This directly undermines evaluation of the central accuracy-efficiency claim.
Authors: We agree that the abstract and §4 would benefit from greater explicitness to allow full evaluation of the claims. The manuscript already compares against fixed-size VFM baselines (e.g., CLIP-ViT variants, BLIP, and prior NAS methods) of comparable parameter counts and FLOPs, with the 7.9% and 5.2% figures representing the maximum observed gains over the strongest such baseline at each operating point. In the revised version we will (i) expand the abstract and §4 to list the exact baseline models and their sizes, (ii) report results averaged over 3–5 runs with standard deviations, and (iii) add paired statistical significance tests for the key accuracy and FLOPs differences. These additions will be placed in a new “Evaluation Protocol” subsection of §4. revision: yes
-
Referee: [§3] §3 (Method): The motivating assumption that 'the effect of model size reduction on performance is task-dependent' is used to justify the entire adaptive NAS+LLM design, yet no controlled ablation or analysis is referenced showing how performance degradation varies across tasks/scenes to support the runtime selection policy.
Authors: The core insight is indeed that performance sensitivity to model size varies with scene complexity and task type; this is what enables the LLM agent to select subnets profitably at runtime. While §4 already shows that AdaVFM outperforms fixed-size models on two distinct tasks (zero-shot classification and open-vocabulary segmentation), we acknowledge that a more targeted, controlled demonstration of the variation itself would strengthen the motivation. In the revised manuscript we will add a dedicated ablation (new Figure or subsection in §3 or §4) that measures accuracy degradation for the same set of subnets across controlled subsets of ImageNet and ADE20K stratified by scene complexity (e.g., object density, lighting variation). This will directly illustrate the task/scene dependence that justifies the adaptive policy. revision: yes
Circularity Check
No circularity in derivation chain
full rationale
The paper describes an engineering framework for adaptive VFM inference using NAS and cloud LLM control, motivated by the empirical observation that model size reduction effects are task-dependent. No equations, first-principles derivations, or predictions are presented that reduce to inputs by construction. All performance claims rest on external benchmark comparisons (IN1K, ADE20K) against prior baselines, with no self-citation load-bearing steps or fitted parameters renamed as predictions. The derivation chain is self-contained against external experimental evidence.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
Meta ray-ban smart glasses.https://www.meta.com/ai-glasses/ray-ban-meta/ (2023), a series of AI-enabled smart glasses combining camera, audio, and voice- controlled Meta AI features, developed by Meta Platforms in partnership with Ray-Ban
work page 2023
-
[2]
In: The Twelfth International Conference on Learning Representations (2024)
Abbaspourazad, S., Elachqar, O., Miller, A., Emrani, S., Nallasamy, U., Shapiro, I.: Large-scale training of foundation models for wearable biosignals. In: The Twelfth International Conference on Learning Representations (2024)
work page 2024
-
[3]
Abrash, M.: Creating the future: Augmented reality, the next human-machine in- terface. In: 2021 IEEE International Electron Devices Meeting (IEDM). pp. 1–11 (2021).https://doi.org/10.1109/IEDM19574.2021.9720526
-
[4]
In: Proceedings of the 36th International Conference on Neural Information Pro- cessing Systems
Alayrac, J.B., Donahue, J., Luc, P., Miech, A., Barr, I., Hasson, Y., Lenc, K., Mensch, A., Millicah, K., Reynolds, M., Ring, R., Rutherford, E., Cabi, S., Han, T., Gong, Z., Samangooei, S., Monteiro, M., Menick, J., Borgeaud, S., Brock, A., Nematzadeh, A., Sharifzadeh, S., Binkowski, M., Barreira, R., Vinyals, O., Zisser- man, A., Simonyan, K.: Flamingo:...
work page 2022
-
[5]
Arm®: Arm ethos-u55 micronpu description.https://www.arm.com/products/ silicon-ip-cpu/ethos/ethos-u55(Accessed 2026-03)
work page 2026
-
[6]
In: European Conference on Computer Vision (2014)
Bossard, L., Guillaumin, M., Van Gool, L.: Food-101 – mining discriminative com- ponents with random forests. In: European Conference on Computer Vision (2014)
work page 2014
-
[7]
In: Proceedings of the 20th ACM International Conference on Multi- modal Interaction
Brun,D.:Multimodalandcontext-awareinteractioninaugmentedrealityforactive assistance. In: Proceedings of the 20th ACM International Conference on Multi- modal Interaction. pp. 506–510 (2018)
work page 2018
-
[8]
Cai, H., Gan, C., Wang, T., Zhang, Z., Han, S.: Once-for-all: Train one network and specialize it for efficient deployment. In: International Conference on Learning Representations (2020),https://openreview.net/forum?id=HylxE1HKwS
work page 2020
-
[9]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Chen, J., Hu, J., Wang, G., Jiang, Z., Zhou, T., Chen, Z., Lv, C.: Taoavatar: Real-time lifelike full-body talking avatars for augmented reality via 3d gaussian splatting. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 10723–10734 (June 2025)
work page 2025
-
[10]
In: The Eleventh International Conference on Learning Representations (2023)
Chen, X., Wang, X., Changpinyo, S., Piergiovanni, A., Padlewski, P., Salz, D., Goodman, S., Grycner, A., Mustafa, B., Beyer, L., et al.: Pali: A jointly-scaled multilingual language-image model. In: The Eleventh International Conference on Learning Representations (2023)
work page 2023
-
[11]
In: Proceedings of the IEEE/CVF International Conference on computer vision
Chu, X., Zhang, B., Xu, R.: Fairnas: Rethinking evaluation fairness of weight shar- ing neural architecture search. In: Proceedings of the IEEE/CVF International Conference on computer vision. pp. 12239–12248 (2021) 16 Y. Zhao et al
work page 2021
-
[12]
In: Proceedings of the IEEE Conf
Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., , Vedaldi, A.: Describing textures in the wild. In: Proceedings of the IEEE Conf. on Computer Vision and Pattern Recognition (CVPR) (2014)
work page 2014
-
[13]
In: 2009 IEEE Conference on Computer Vision and Pattern Recognition
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: 2009 IEEE Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009)
work page 2009
-
[14]
In: Proceedings of the 40th International Conference on Machine Learning
Driess, D., Xia, F., Sajjadi, M.S.M., Lynch, C., Chowdhery, A., Ichter, B., Wahid, A., Tompson, J., Vuong, Q., Yu, T., Huang, W., Chebotar, Y., Sermanet, P., Duckworth, D., Levine, S., Vanhoucke, V., Hausman, K., Toussaint, M., Greff, K., Zeng, A., Mordatch, I., Florence, P.: Palm-e: an embodied multimodal language model. In: Proceedings of the 40th Inter...
work page 2023
-
[15]
In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G
Fan, Y., Ma, X., Wu, R., Du, Y., Li, J., Gao, Z., Li, Q.: Videoagent: A memory- augmented multimodal agent for video understanding. In: Leonardis, A., Ricci, E., Roth, S., Russakovsky, O., Sattler, T., Varol, G. (eds.) Computer Vision – ECCV
-
[16]
pp. 75–92. Springer Nature Switzerland, Cham (2025)
work page 2025
-
[17]
Fang, A., Jose, A.M., Jain, A., Schmidt, L., Toshev, A.T., Shankar, V.: Data filtering networks. In: The Twelfth International Conference on Learning Repre- sentations (2024),https://openreview.net/forum?id=KAk6ngZ09F
work page 2024
-
[18]
Computer vision and Image understanding106(1), 59–70 (2007)
Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object cate- gories. Computer vision and Image understanding106(1), 59–70 (2007)
work page 2007
-
[19]
Fellbaum, C.: WordNet: An electronic lexical database. MIT press (1998)
work page 1998
-
[20]
Fundamental AI Research, M.: Introducing llama 4: Advancing multimodal intelli- gence.https://ai.meta.com/blog/llama-4-multimodal-intelligence/(2024), accessed April 5, 2025
work page 2024
-
[21]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Gebru, T., Krause, J., Wang, Y., Chen, D., Deng, J., Fei-Fei, L.: Fine-grained car detection for visual census estimation. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 31 (2017)
work page 2017
-
[22]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Georg, M., Tanzer, G., Uboweja, E., Hassan, S., Shengelia, M., Sepah, S., Forbes, S., Starner, T.: Fsboard: Over 3 million characters of asl fingerspelling collected via smartphones. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 13897–13906 (June 2025)
work page 2025
-
[23]
In: European conference on computer vision
Ghiasi,G.,Gu,X.,Cui,Y.,Lin,T.Y.:Scalingopen-vocabularyimagesegmentation with image-level labels. In: European conference on computer vision. pp. 540–557. Springer (2022)
work page 2022
-
[24]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Goswami, R.G., Krishnamurthy, P., LeCun, Y., Khorrami, F.: Robopepp: Vision- based robot pose and joint angle estimation through embedding predictive pre- training. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6930–6939 (June 2025)
work page 2025
-
[25]
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R.: Masked autoencoders are scalablevisionlearners.In:ProceedingsoftheIEEE/CVFConferenceonComputer Vision and Pattern Recognition (CVPR) (2022)
work page 2022
-
[26]
Hoang, M.L.: A comprehensive review of machine learning, and deep learning in wearable iot devices. Ieee Access (2025)
work page 2025
-
[27]
Howard, A., Sandler, M., Chen, B., Wang, W., Chen, L.C., Tan, M., Chu, G., Vasudevan, V., Zhu, Y., Pang, R., Adam, H., Le, Q.: Searching for mobilenetv3. In: 2019 IEEE/CVF International Conference on Computer Vision (ICCV). pp. 1314–1324 (2019).https://doi.org/10.1109/ICCV.2019.00140 AdaVFM 17
-
[28]
Ilharco, G., Wortsman, M., Carlini, N., Taori, R., Dave, A., Shankar, V., Namkoong, H., Miller, J., Hajishirzi, H., Farhadi, A., et al.: Openclip. Zenodo (2021)
work page 2021
-
[29]
Fiaz, Al- ham Fikri Aji, and Hisham Cholakkal
Imam, M.F., Marew, R.F., Hassan, J., Fiaz, M., Aji, A.F., Cholakkal, H.: Clip meets dino for tuning zero-shot classifier using unlabeled image collections. arXiv preprint arXiv:2411.19346 (2024)
-
[30]
In: Proceedings of the Computer Vision and Pattern Recognition Conference
Jose, C., Moutakanni, T., Kang, D., Baldassarre, F., Darcet, T., Xu, H., Li, D., Szafraniec, M., Ramamonjisoa, M., Oquab, M., et al.: Dinov2 meets text: A unified framework for image-and pixel-level vision-language alignment. In: Proceedings of the Computer Vision and Pattern Recognition Conference. pp. 24905–24916 (2025)
work page 2025
-
[31]
Google DeepMind Blog (Mar 2025)
Kavukcuoglu, K., Pichai, S., Hassabis, D., Walker, K., Manyika, J., Porat, R.: Gemini-2.5: Our most intelligent ai model. Google DeepMind Blog (Mar 2025)
work page 2025
-
[32]
In: Proceedings of the AAAI Conference on Artificial Intelligence
Khattak, M.U., Naeem, M.F., Naseer, M., Van Gool, L., Tombari, F.: Learning to prompt with text only supervision for vision-language models. In: Proceedings of the AAAI Conference on Artificial Intelligence. vol. 39, pp. 4230–4238 (2025)
work page 2025
-
[33]
In: 2023 IEEE International Conference on Consumer Electronics (ICCE)
Kim, S.Y., Chung, D.o., Lee, K., Lee, C., Huh, J.: Low-power always-on cam- era (aoc) system with workload offloading to cmos image sensor. In: 2023 IEEE International Conference on Consumer Electronics (ICCE). pp. 1–2. IEEE (2023)
work page 2023
-
[34]
Krizhevsky, A., Hinton, G.: Learning multiple layers of features from tiny images (2009)
work page 2009
-
[35]
In: 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN)
Lane, N.D., Bhattacharya, S., Georgiev, P., Forlivesi, C., Jiao, L., Qendro, L., Kawsar, F.: Deepx: A software accelerator for low-power deep learning inference on mobile devices. In: 2016 15th ACM/IEEE International Conference on Information Processing in Sensor Networks (IPSN). pp. 1–12 (2016).https://doi.org/10. 1109/IPSN.2016.7460664
-
[36]
In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition
Lee, J., Joo, H.: Mocap everyone everywhere: Lightweight motion capture with smartwatches and a head-mounted camera. In: Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition. pp. 1091–1100 (2024)
work page 2024
-
[37]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Li, C., Tang, T., Wang, G., Peng, J., Wang, B., Liang, X., Chang, X.: Boss- nas: Exploring hybrid cnn-transformers with block-wisely self-supervised neural architecture search. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 12281–12291 (2021)
work page 2021
-
[38]
Li, E., Zhou, Z., Chen, X.: Edge intelligence: On-demand deep learning model co-inference with device-edge synergy. In: Proceedings of the 2018 Workshop on MobileEdgeCommunications.p.31–36.MECOMM’18,AssociationforComputing Machinery, New York, NY, USA (2018).https://doi.org/10.1145/3229556. 3229562,https://doi.org/10.1145/3229556.3229562
-
[39]
Li, J.N., Zhang, Z.J., Ma, J.: Omniquery: Contextually augmenting captured mul- timodal memories to enable personal question answering. In: Proceedings of the 2025 CHI Conference on Human Factors in Computing Systems. CHI ’25, Associ- ation for Computing Machinery, New York, NY, USA (2025).https://doi.org/ 10.1145/3706598.3713448,https://doi.org/10.1145/3...
-
[40]
In: Proceedings of the 40th International Conference on Machine Learning
Li, J., Li, D., Savarese, S., Hoi, S.: Blip-2: bootstrapping language-image pre- training with frozen image encoders and large language models. In: Proceedings of the 40th International Conference on Machine Learning. ICML’23, JMLR.org (2023)
work page 2023
-
[41]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Lin, R., Weng, P., Wang, Y., Ding, H., Han, J., Wang, F.: Hilots: High-low tem- poral sensitive representation learning for semi-supervised lidar segmentation in autonomous driving. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 1429–1438 (June 2025) 18 Y. Zhao et al
work page 2025
-
[42]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Liu, Z., Mao, H., Wu, C.Y., Feichtenhofer, C., Darrell, T., Xie, S.: A convnet for the 2020s. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 11976–11986 (2022)
work page 2022
-
[43]
Lu, Z., Sreekumar, G., Goodman, E., Banzhaf, W., Deb, K., Boddeti, V.N.: Neural architecture transfer. IEEE Transactions on Pattern Analysis and Machine Intelli- gence43(9), 2971–2989 (2021).https://doi.org/10.1109/TPAMI.2021.3052758
-
[44]
Maji, S., Kannala, J., Rahtu, E., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. Tech. rep. (2013)
work page 2013
-
[45]
In: International Conference on Learning Representa- tions (2022)
Mehta, S., Rastegari, M.: Mobilevit: Light-weight, general-purpose, and mobile- friendly vision transformer. In: International Conference on Learning Representa- tions (2022)
work page 2022
-
[46]
Advances in Neural Information Processing Systems36, 5765– 5777 (2023)
Mirza, M.J., Karlinsky, L., Lin, W., Possegger, H., Kozinski, M., Feris, R., Bischof, H.: Lafter: Label-free tuning of zero-shot classifier using language and unlabeled image collections. Advances in Neural Information Processing Systems36, 5765– 5777 (2023)
work page 2023
-
[47]
Emogen: Emotional image content generation with text-to-image diffusion models,
Moon, G., Weipeng, X., Joshi, R., Chenglei, W., Shiratori, T.: Authentic hand avatar from a phone scan via universal hand model. In: 2024 IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition (CVPR). pp. 2029–2038 (2024). https://doi.org/10.1109/CVPR52733.2024.00198
-
[48]
IEEE Transactions on Circuits and Systems II: Express Briefs 68(9), 3078–3082 (2021)
Nazhamaiti, M., Xu, H., Liu, Z., Chen, Y., Wei, Q., Wu, X., Qiao, F.: Ns-md: near-sensor motion detection with energy harvesting image sensor for always-on visual perception. IEEE Transactions on Circuits and Systems II: Express Briefs 68(9), 3078–3082 (2021)
work page 2021
-
[49]
OpenAI: Introducing gpt-5.https://openai.com/index/introducing- gpt- 5/ (August 2025),https://openai.com/index/introducing-gpt-5/, large language model
work page 2025
-
[50]
Transactions on Machine Learning Research Journal pp
Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., et al.: Dinov2: Learning robust visual features without supervision. Transactions on Machine Learning Research Journal pp. 1–31 (2024)
work page 2024
-
[51]
In: 2012 IEEE conference on computer vision and pattern recognition
Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: 2012 IEEE conference on computer vision and pattern recognition. pp. 3498–3505. IEEE (2012)
work page 2012
-
[52]
Raaen, K., Kjellmo, I.: Measuring latency in virtual reality systems. In: En- tertainment Computing - ICEC 2015. p. 457–462. Springer-Verlag, Berlin, Hei- delberg (2022).https://doi.org/10.1007/978- 3- 319- 24589- 8_40,https: //doi.org/10.1007/978-3-319-24589-8_40
-
[53]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PmLR (2021)
work page 2021
-
[54]
In: International conference on machine learning
Real, E., Moore, S., Selle, A., Saxena, S., Suematsu, Y.L., Tan, J., Le, Q.V., Ku- rakin, A.: Large-scale evolution of image classifiers. In: International conference on machine learning. pp. 2902–2911. PMLR (2017)
work page 2017
-
[55]
In: Proceedings of the IEEE/CVF international conference on computer vision
Roth, K., Kim, J.M., Koepke, A., Vinyals, O., Schmid, C., Akata, Z.: Waffling around for performance: Visual classification with random words and broad con- cepts. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 15746–15757 (2023)
work page 2023
-
[56]
IEEE Com- munications Surveys & Tutorials19(4), 2573–2620 (2017) AdaVFM 19
Seneviratne, S., Hu, Y., Nguyen, T., Lan, G., Khalifa, S., Thilakarathna, K., Has- san, M., Seneviratne, A.: A survey of wearable devices and challenges. IEEE Com- munications Surveys & Tutorials19(4), 2573–2620 (2017) AdaVFM 19
work page 2017
-
[57]
Serianni, A., Kalita, J.: Training-free neural architecture search for RNNs and transformers.In:Rogers,A.,Boyd-Graber,J.,Okazaki,N.(eds.)Proceedingsofthe 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). pp. 2522–2540. Association for Computational Linguistics, Toronto, Canada (Jul 2023).https://doi.org/10.18653...
-
[58]
In: International Conference on Artificial Intelligence and Statistics
Shrivastava, A., Selvaraju, R.R., Naik, N., Ordonez, V.: Clip-lite: Information ef- ficient visual representation learning with language supervision. In: International Conference on Artificial Intelligence and Statistics. pp. 8433–8447. PMLR (2023)
work page 2023
-
[59]
In: 2020 IEEE Hot Chips 32 Symposium (HCS)
Skillman, A., Edsö, T.: A technical overview of cortex-m55 and ethos-u55: Arm’s most capable processors for endpoint ai. In: 2020 IEEE Hot Chips 32 Symposium (HCS). pp. 1–20 (2020).https://doi.org/10.1109/HCS49909.2020.9220415
-
[60]
Stauffert, J.P., Niebling, F., Latoschik, M.E.: Reducing application-stage latencies for real-time interactive systems. In: 2016 IEEE 9th Workshop on Software En- gineering and Architectures for Realtime Interactive Systems (SEARIS). pp. 1–7. IEEE (2016)
work page 2016
-
[61]
In: International conference on machine learning
Tan, M., Le, Q.: Efficientnet: Rethinking model scaling for convolutional neural networks. In: International conference on machine learning. pp. 6105–6114. PMLR (2019)
work page 2019
-
[62]
Llama 2: Open Foundation and Fine-Tuned Chat Models
Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y., Bash- lykov, N., Batra, S., Bhargava, P., Bhosale, S., et al.: Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288 (2023)
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[63]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Woo, S., Debnath, S., Hu, R., Chen, X., Liu, Z., Kweon, I.S., Xie, S.: Convnext v2: Co-designing and scaling convnets with masked autoencoders. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 16133– 16142 (2023)
work page 2023
-
[64]
In: Proceedings of the IEEE/CVF International Conference on Computer Vision
Wu, K., Peng, H., Zhou, Z., Xiao, B., Liu, M., Yuan, L., Xuan, H., Valenzuela, M., Chen, X.S., Wang, X., et al.: Tinyclip: Clip distillation via affinity mimicking and weight inheritance. In: Proceedings of the IEEE/CVF International Conference on Computer Vision. pp. 21970–21980 (2023)
work page 2023
-
[65]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)
Xia,W.,Feng,R.,Wang,D.,Hu,D.:Phoenix:Amotion-basedself-reflectionframe- work for fine-grained robotic action correction. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR). pp. 6981–6990 (June 2025)
work page 2025
-
[66]
In: 2010 IEEE computer society conference on computer vision and pattern recognition
Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large-scale scene recognition from abbey to zoo. In: 2010 IEEE computer society conference on computer vision and pattern recognition. pp. 3485–3492. IEEE (2010)
work page 2010
-
[67]
Advances in Neural Information Processing Systems36, 68798–68809 (2023)
Xing, Y., Kang, J., Xiao, A., Nie, J., Shao, L., Lu, S.: Rewrite caption semantics: Bridging semantic gaps for language-supervised semantic segmentation. Advances in Neural Information Processing Systems36, 68798–68809 (2023)
work page 2023
-
[68]
In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR)
Xing, Z., Zhang, X., Hu, Y., Jiang, B., He, T., Zhang, Q., Long, X., Yin, W.: Goalflow: Goal-driven flow matching for multimodal trajectories generation in end- to-end autonomous driving. In: Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition (CVPR). pp. 1602–1611 (June 2025)
work page 2025
-
[69]
In: The Twelfth International Conference on Learning Representations (2024)
Xu, H., Xie, S., Tan, X., Huang, P.Y., Howes, R., Sharma, V., Li, S.W., Ghosh, G., Zettlemoyer, L., Feichtenhofer, C.: Demystifying CLIP data. In: The Twelfth International Conference on Learning Representations (2024)
work page 2024
-
[70]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition
Xu, J., Hou, J., Zhang, Y., Feng, R., Wang, Y., Qiao, Y., Xie, W.: Learning open- vocabulary semantic segmentation models from natural language supervision. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recog- nition. pp. 2935–2944 (2023) 20 Y. Zhao et al
work page 2023
-
[71]
In: Proceedings of the IEEE/CVF international conference on computer vision
Yu, J., Huang, T.S.: Universally slimmable networks and improved training tech- niques. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 1803–1811 (2019)
work page 2019
-
[72]
In: Computer Vision–ECCV 2020: 16th European Conference, Part VII 16
Yu, J., Jin, P., Liu, H., Bender, G., Kindermans, P.J., Tan, M., Huang, T., Song, X., Pang, R., Le, Q.: Bignas: Scaling up neural architecture search with big single- stage models. In: Computer Vision–ECCV 2020: 16th European Conference, Part VII 16. pp. 702–717. Springer (2020)
work page 2020
-
[73]
In: Proceedings of the IEEE/CVF international conference on computer vision
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language im- age pre-training. In: Proceedings of the IEEE/CVF international conference on computer vision. pp. 11975–11986 (2023)
work page 2023
-
[74]
In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition
Zhai, X., Wang, X., Mustafa, B., Steiner, A., Keysers, D., Kolesnikov, A., Beyer, L.: Lit: Zero-shot transfer with locked-image text tuning. In: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. pp. 18123– 18133 (2022)
work page 2022
-
[75]
Advances in Neural Information Processing Systems35, 36067–36080 (2022)
Zhang, H., Zhang, P., Hu, X., Chen, Y.C., Li, L., Dai, X., Wang, L., Yuan, L., Hwang, J.N., Gao, J.: Glipv2: Unifying localization and vision-language under- standing. Advances in Neural Information Processing Systems35, 36067–36080 (2022)
work page 2022
-
[76]
In: Proceedings of the 30th Asia and South Pacific Design Automation Conference
Zhao, Y., Chen, J., Zhang, S.Q., Sarwar, S.S., Stangherlin, K.H., Gomez, J.T., Seo, J.S., De Salvo, B., Liu, C., Gibbons, P.B., Li, Z.: H4h: Hybrid convolution- transformer architecture search for npu-cim heterogeneous systems for ar/vr ap- plications. In: Proceedings of the 30th Asia and South Pacific Design Automation Conference. p. 1133–1141. ASPDAC ’2...
-
[77]
In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
Zhou, B., Zhao, H., Puig, X., Fidler, S., Barriuso, A., Torralba, A.: Scene parsing through ade20k dataset. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2017)
work page 2017
-
[78]
International Journal of Computer Vision127(3), 302–321 (2019)
Zhou, B., Zhao, H., Puig, X., Xiao, T., Fidler, S., Barriuso, A., Torralba, A.: Se- mantic understanding of scenes through the ade20k dataset. International Journal of Computer Vision127(3), 302–321 (2019)
work page 2019
-
[79]
In: International Conference on Learning Representations (ICLR) (2022)
Zhou, J., Yu, X., Luo, P., et al.: ibot: Image bert pre-training with online tokenizer. In: International Conference on Learning Representations (ICLR) (2022)
work page 2022
-
[80]
Zoph, B., Le, Q.: Neural architecture search with reinforcement learning. In: In- ternational Conference on Learning Representations (2016) AdaVFM: Supplementary Material A Hardware Platform and Evaluation Setup We adopt the ARM Ethos-U55 [5,58] as a representative edge Neural Processing Unit (NPU). The test silicon (Fig. 1) is fabricated in 7nm FinFET an...
work page 2016
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.