pith. machine review for the scientific record. sign in

arxiv: 2605.05910 · v1 · submitted 2026-05-07 · 💻 cs.CV

Recognition: unknown

Plug-and-play Class-aware Knowledge Injection for Prompt Learning with Visual-Language Model

Authors on Pith no claims yet

Pith reviewed 2026-05-09 15:47 UTC · model grok-4.3

classification 💻 cs.CV
keywords prompt learningclass-aware knowledge injectionvision-language modelsfew-shot learningzero-shot classificationCLIPknowledge bankplug-and-play
0
0 comments X

The pith

CAKI supplements class-specific knowledge from few-shot samples into prompt learning to improve vision-language model accuracy on base and novel classes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that prompt learning in vision-language models like CLIP overlooks valuable class-specific knowledge, which leads to suboptimal results in downstream classification. Class-specific prompts deliver finer supervision than class-shared prompts and richer information than instance-specific ones, thereby reducing both inter-class and intra-class misclassifications. The proposed CAKI framework generates class-specific prompts from few-shot samples of each class, stores them in a knowledge bank, and applies query-key matching to inject relevant knowledge into predictions for new test instances. This plug-and-play design integrates with existing methods and yields measurable gains on both base and novel classes. A sympathetic reader would care because the approach offers a lightweight way to enhance zero-shot and few-shot performance in domain-specific tasks without retraining the base model.

Core claim

The authors propose the Class-Aware Knowledge Injection (CAKI) framework, which includes class-specific prompt generation to encode knowledge from same-class few-shot samples into prompts stored in a class-level knowledge bank, and query-key prompt matching that lets each test instance retrieve and inject the matching class knowledge to refine model predictions, resulting in improved performance on both base and novel classes.

What carries the argument

The CAKI framework's query-key prompt matching mechanism, which retrieves relevant entries from a class-level knowledge bank built via class-specific prompt generation from few-shot samples.

If this is right

  • Existing prompt learning methods achieve higher accuracy on both base and novel classes when CAKI is added.
  • Class-specific knowledge prevents data from different classes being misclassified into one class.
  • Class-level information from multiple instances prevents data from the same class being split across multiple classes.
  • The plug-and-play design allows refinement of predictions in zero-shot domain-specific classification without altering the underlying VLM.
  • The approach works with various existing prompt learning techniques through simple integration.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The knowledge bank construction could be extended to handle open-vocabulary settings with dynamically added classes.
  • Performance may improve further if the matching mechanism is combined with instance-specific prompt generators.
  • The method suggests a general pattern for injecting structured knowledge into other multimodal prompt-based systems.
  • Scalability of the knowledge bank with increasing numbers of classes remains an open question for large-scale deployment.

Load-bearing premise

The assumption that class-specific prompts derived from few-shot samples can be reliably encoded and that the query-key matching will accurately retrieve and inject relevant knowledge without introducing errors that degrade predictions.

What would settle it

An experiment on a few-shot benchmark such as 16-shot ImageNet where integrating CAKI into a baseline prompt learning method produces no accuracy gain or a drop on novel classes compared to the unmodified baseline.

read the original abstract

Prompt learning has become an effective and widely used technique in enhancing vision-language models (VLMs) such as CLIP for various downstream tasks, particularly in zero-shot classification within specific domains. Existing methods typically focus on either learning class-shared prompts for a given domain or generating instance-specific prompts through conditional prompt learning. While these methods have achieved promising performance, they often overlook class-specific knowledge in prompt design, leading to suboptimal outcomes. The underlying reasons are: 1) class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, which helps prevent misclassification of data from different classes into a single class; 2) compared to class-specific prompts, instance-specific prompts neglect the richer class-level information across multiple instances, potentially causing data from the same class to be divided into multiple classes. To effectively supplement the class-specific knowledge into existing methods, we propose a plug-and-play Class-Aware Knowledge Injection (CAKI) framework. CAKI comprises two key components, i.e., class-specific prompt generation and query-key prompt matching. The former encodes class-specific knowledge into prompts from few-shot samples that belong to the same class and stores the learned prompts in a class-level knowledge bank. The latter provides a plug-and-play mechanism for each test instance to retrieve relevant class-level knowledge from the knowledge bank and inject such knowledge to refine model predictions. Extensive experiments demonstrate that our CAKI effectively improves the performance of existing methods on base and novel classes. Code is publicly available at \href{https://github.com/yjh576/CAKI}{this https URL}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes CAKI, a plug-and-play framework for vision-language models that supplements class-specific knowledge into prompt learning. It consists of (1) generating class-specific prompts from few-shot samples of the same class and storing them in a class-level knowledge bank, and (2) a query-key matching step that retrieves relevant entries from the bank for each test instance and injects the knowledge to refine predictions. The central claim is that this improves performance of existing prompt-learning methods on both base and novel classes.

Significance. If the mechanism is sound, the work addresses a genuine gap between coarse class-shared prompts and overly fine instance-specific prompts by adding class-level supervision. The public release of code at the cited GitHub repository is a clear strength that enables direct reproduction and extension.

major comments (1)
  1. [Abstract] Abstract and the description of class-specific prompt generation: the claim that CAKI improves novel-class performance 'by supplementing class-specific knowledge' is not supported by the stated mechanism. The knowledge bank is populated exclusively from few-shot samples belonging to the same class; under the standard base-to-novel protocol, novel classes have zero shots, so the bank contains only base-class entries. The query-key matching step therefore cannot retrieve class-specific knowledge for a novel-class query. No additional generation step, similarity-threshold rule, or fallback that would supply novel-class-specific knowledge is described. This directly undermines attribution of any observed novel-class gains to the proposed injection mechanism.
minor comments (1)
  1. [Abstract] The abstract states that 'extensive experiments demonstrate' gains but provides no quantitative numbers, baselines, or dataset details; these belong in the main text or a results table.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The major comment identifies a key issue with how the abstract attributes novel-class gains to the class-specific injection mechanism. We address this point directly below and will revise the manuscript to align claims with the described method.

read point-by-point responses
  1. Referee: [Abstract] Abstract and the description of class-specific prompt generation: the claim that CAKI improves novel-class performance 'by supplementing class-specific knowledge' is not supported by the stated mechanism. The knowledge bank is populated exclusively from few-shot samples belonging to the same class; under the standard base-to-novel protocol, novel classes have zero shots, so the bank contains only base-class entries. The query-key matching step therefore cannot retrieve class-specific knowledge for a novel-class query. No additional generation step, similarity-threshold rule, or fallback that would supply novel-class-specific knowledge is described. This directly undermines attribution of any observed novel-class gains to the proposed injection mechanism.

    Authors: We appreciate the referee highlighting this important clarification. Upon re-examination, we agree that the class-specific prompt generation step relies exclusively on few-shot samples of the same class, so under the standard base-to-novel protocol the knowledge bank contains entries only for base classes. For novel-class test instances the query-key matching therefore has no relevant class-specific entries to retrieve or inject. The empirical improvements on novel classes reported in the experiments are thus not directly attributable to class-specific knowledge supplementation for those classes; they likely arise indirectly from the plug-and-play enhancement of the underlying prompt-learning methods on base classes, which improves overall VLM generalization. We will revise the abstract, method description, and discussion sections to remove the direct attribution of novel-class gains to the injection mechanism, explicitly note the limitation for novel classes, and clarify what the framework achieves. These changes will ensure all claims are precisely supported by the proposed components. revision: yes

Circularity Check

0 steps flagged

No significant circularity in proposed CAKI framework

full rationale

The manuscript presents an empirical plug-and-play framework (class-specific prompt generation from few-shot samples into a knowledge bank, followed by query-key matching for injection) that augments existing prompt-learning methods for VLMs. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the described structure or abstract. The approach takes pre-trained VLMs and external few-shot data as independent inputs and reports experimental gains on base/novel classes; the central claims therefore remain self-contained against external benchmarks rather than reducing to internal definitions or self-referential fits.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on domain assumptions about the relative value of different prompt types and the effectiveness of the matching mechanism; no free parameters or invented entities are described in the abstract.

axioms (2)
  • domain assumption Class-specific prompts offer more fine-grained supervision compared to coarse class-shared prompts, helping prevent misclassification across classes.
    Explicitly listed as underlying reason 1 in the abstract.
  • domain assumption Instance-specific prompts neglect richer class-level information across multiple instances, potentially causing same-class data to be split.
    Explicitly listed as underlying reason 2 in the abstract.

pith-pipeline@v0.9.0 · 5599 in / 1317 out tokens · 35530 ms · 2026-05-09T15:47:59.321192+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 9 canonical work pages · 3 internal anchors

  1. [1]

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J.,et al.: Learning transferable visual models from natural language supervision. In: Int. Conf. Mach. Learn., pp. 8748–8763 (2021). PMLR

  2. [2]

    Jia, C., Yang, Y., Xia, Y., Chen, Y.-T., Parekh, Z., Pham, H., Le, Q., Sung, Y.-H., Li, Z., Duerig, T.: Scaling up visual and vision-language representation learning with noisy text supervision. In: Int. Conf. Mach. Learn., pp. 4904–4916 (2021). PMLR

  3. [3]

    Lee, J., Kim, J., Shon, H., Kim, B., Kim, S.H., Lee, H., Kim, J.: Uniclip: Unified framework for cocoopntrastive language-image pre-training. Adv. Neural Inform. Process. Syst.35, 1008–1019 (2022)

  4. [4]

    Towards a Unified View of Parameter-Efficient Transfer Learning , journal =

    He, J., Zhou, C., Ma, X., Berg-Kirkpatrick, T., Neubig, G.: Towards a unified view of parameter-efficient transfer learning. arXiv preprint arXiv:2110.04366 (2021) 24

  5. [5]

    Gao, P., Geng, S., Zhang, R., Ma, T., Fang, R., Zhang, Y., Li, H., Qiao, Y.: Clip- adapter: Better vision-language models with feature adapters. Int. J. Comput. Vis. (2024)

  6. [6]

    Gabeff, V., Rußwurm, M., Tuia, D., Mathis, A.: Wildclip: Scene and animal attribute retrieval from camera trap data with domain-adapted vision-language models. Int. J. Comput. Vis.132(9), 3770–3786 (2024)

  7. [7]

    Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Learning to prompt for vision-language models. Int. J. Comput. Vis. (2022)

  8. [8]

    In: IEEE Conf

    Zhou, K., Yang, J., Loy, C.C., Liu, Z.: Conditional prompt learning for vision- language models. In: IEEE Conf. Comput. Vis. Pattern Recog. (2022)

  9. [9]

    In: ACL (2021)

    Li, X.L., Liang, P.: Prefix-tuning: Optimizing continuous prompts for generation. In: ACL (2021)

  10. [10]

    In: EMNLP (2021)

    Lester, B., Al-Rfou, R., Constant, N.: The power of scale for parameter-efficient prompt tuning. In: EMNLP (2021)

  11. [11]

    Exploring visual prompts for adapting large- scale models

    Bahng, H., Jahanian, A., Sankaranarayanan, S., Isola, P.: Exploring visual prompts for adapting large-scale models. arXiv preprint arXiv:2203.17274 (2022)

  12. [12]

    Jia, M., Tang, L., Chen, B.-C., Cardie, C., Belongie, S., Hariharan, B., Lim, S.-N.: Visual prompt tuning. In: Eur. Conf. Comput. Vis., pp. 709–727 (2022). Springer

  13. [13]

    Unified vision and language prompt learning.arXiv preprint arXiv:2210.07225, 2022

    Zang, Y., Li, W., Zhou, K., Huang, C., Loy, C.C.: Unified vision and language prompt learning. arXiv preprint arXiv:2210.07225 (2022)

  14. [14]

    In: IEEE Conf

    Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S.: Maple: Multi-modal prompt learning. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 19113–19122 (2023)

  15. [15]

    Khattak, M.U., Wasim, S.T., Naseer, M., Khan, S., Yang, M.-H., Khan, F.S.: Self-regulating prompts: Foundational model adaptation without forgetting. In: Int. Conf. Comput. Vis., pp. 15190–15200 (2023)

  16. [16]

    Wang, Yang: Promp- tkd: Unsupervised prompt distillation for vision-language models

    Zheng, L., Xiang, L., Xinyi, F., Xing, Z., Weiqiang, J. Wang, Yang: Promp- tkd: Unsupervised prompt distillation for vision-language models. IEEE Conf. Comput. Vis. Pattern Recog. (2024)

  17. [17]

    Plot: Prompt learning with optimal transport for vision-language models.arXiv preprint arXiv:2210.01253, 2022

    Chen, G., Yao, W., Song, X., Li, X., Rao, Y., Zhang, K.: Plot: Prompt learning with optimal transport for vision-language models. arXiv preprint arXiv:2210.01253 (2022)

  18. [18]

    Miyai, A., Yu, Q., Irie, G., Aizawa, K.: Locoop: Few-shot out-of-distribution detection via prompt learning. Adv. Neural Inform. Process. Syst.36, 76298– 76310 (2023) 25

  19. [19]

    Lafon, M., Ramzi, E., Rambour, C., Audebert, N., Thome, N.: Gallop: Learning global and local prompts for vision-language models. In: Eur. Conf. Comput. Vis., pp. 264–282 (2024). Springer

  20. [20]

    Liang, J., He, R., Tan, T.: A comprehensive survey on test-time adaptation under distribution shifts. Int. J. Comput. Vis.133(1), 31–64 (2025)

  21. [21]

    Shu, M., Nie, W., Huang, D.-A., Yu, Z., Goldstein, T., Anandkumar, A., Xiao, C.: Test-time prompt tuning for zero-shot generalization in vision-language models. Adv. Neural Inform. Process. Syst.35, 14274–14289 (2022)

  22. [22]

    Abdul Samadh, J., Gani, M.H., Hussein, N., Khattak, M.U., Naseer, M.M., Shahbaz Khan, F., Khan, S.H.: Align your prompts: Test-time prompting with distribution alignment for zero-shot generalization. Adv. Neural Inform. Process. Syst.36(2024)

  23. [23]

    Feng, C.-M., Yu, K., Liu, Y., Khan, S., Zuo, W.: Diverse data augmentation with diffusions for effective test-time prompt tuning. In: Int. Conf. Comput. Vis. (2023)

  24. [24]

    Grave, E., Cisse, M.M., Joulin, A.: Unbounded cache model for online language modeling with open vocabulary. Adv. Neural Inform. Process. Syst.30(2017)

  25. [25]

    Pointer Sentinel Mixture Models

    Merity, S., Xiong, C., Bradbury, J., Socher, R.: Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843 (2016)

  26. [26]

    arXiv preprint arXiv:2111.03930 , year=

    Zhang, R., Fang, R., Zhang, W., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H.: Tip- adapter: Training-free clip-adapter for better vision-language modeling. arXiv preprint arXiv:2111.03930 (2021)

  27. [27]

    In: IEEE Conf

    Karmanov, A., Guan, D., Lu, S., El Saddik, A., Xing, E.: Efficient test-time adap- tation of vision-language models. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 14162–14171 (2024)

  28. [28]

    In: IEEE Conf

    Khandelwal, A.: Promptsync: Bridging domain gaps in vision-language mod- els through class-aware prototype alignment and discrimination. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 7819–7828 (2024)

  29. [29]

    In: IEEE Conf

    Yao, H., Zhang, R., Xu, C.: Tcp: Textual-based class-aware prompt tuning for visual-language model. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 23438– 23448 (2024)

  30. [30]

    ACM Computing Surveys55(9), 1–35 (2023)

    Liu, P., Yuan, W., Fu, J., Jiang, Z., Hayashi, H., Neubig, G.: Pre-train, prompt, and predict: A systematic survey of prompting methods in natural language processing. ACM Computing Surveys55(9), 1–35 (2023)

  31. [31]

    In: IEEE Conf

    Wang, Z., Zhang, Z., Lee, C.-Y., Zhang, H., Sun, R., Ren, X., Su, G., Perot, V., 26 Dy, J., Pfister, T.: Learning to prompt for continual learning. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 139–149 (2022)

  32. [32]

    In: IEEE Conf

    Fei-Fei, L., Fergus, R., Perona, P.: Learning generative visual models from few training examples: An incremental bayesian approach tested on 101 object cat- egories. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 178–178 (2004). IEEE

  33. [33]

    In: IEEE Conf

    Parkhi, O.M., Vedaldi, A., Zisserman, A., Jawahar, C.: Cats and dogs. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 3498–3505 (2012). IEEE

  34. [34]

    In: IEEE Conf

    Krause, J., Stark, M., Deng, J., Fei-Fei, L.: 3d object representations for fine- grained categorization. In: IEEE Conf. Comput. Vis. Pattern Recog. Worksh., pp. 554–561 (2013)

  35. [35]

    In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp

    Nilsback, M.-E., Zisserman, A.: Automated flower classification over a large num- ber of classes. In: 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pp. 722–729 (2008). IEEE

  36. [36]

    Bossard, L., Guillaumin, M., Van Gool, L.: Food-101–mining discriminative com- ponents with random forests. In: Eur. Conf. Comput. Vis., pp. 446–461 (2014). Springer

  37. [37]

    Fine-Grained Visual Classification of Aircraft

    Maji, S., Rahtu, E., Kannala, J., Blaschko, M., Vedaldi, A.: Fine-grained visual classification of aircraft. arXiv preprint arXiv:1306.5151 (2013)

  38. [38]

    IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing12(7), 2217–2226 (2019)

    Helber, P., Bischke, B., Dengel, A., Borth, D.: Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing12(7), 2217–2226 (2019)

  39. [39]

    In: IEEE Conf

    Xiao, J., Hays, J., Ehinger, K.A., Oliva, A., Torralba, A.: Sun database: Large- scale scene recognition from abbey to zoo. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 3485–3492 (2010). IEEE

  40. [40]

    In: IEEE Conf

    Cimpoi, M., Maji, S., Kokkinos, I., Mohamed, S., Vedaldi, A.: Describing textures in the wild. In: IEEE Conf. Comput. Vis. Pattern Recog., pp. 3606–3613 (2014)

  41. [41]

    UCF101: A Dataset of 101 Human Actions Classes From Videos in The Wild

    Soomro, K., Zamir, A.R., Shah, M.: Ucf101: A dataset of 101 human actions classes from videos in the wild. arXiv preprint arXiv:1212.0402 (2012)

  42. [42]

    arXiv preprint arXiv:2506.17307 (2025) 27

    Chi, Z., Gu, L., Liu, H., Wang, Z., Wu, Y., Wang, Y., Plataniotis, K.N.: Learning to adapt frozen clip for few-shot test-time domain adaptation. arXiv preprint arXiv:2506.17307 (2025) 27