pith. machine review for the scientific record. sign in

arxiv: 2605.11659 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI
keywords cross-domain few-shot learningCLIP fine-tuningattention rectificationadapter methodsprompt tuningmodality alignmentsemantic probevision-language models
0
0 comments X

The pith

Rectifying collapsed attention in CLIP makes prompt-based fine-tuning competitive again for cross-domain few-shot learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes baselines showing adapter methods like LoRA outperform prompt methods like MaPLe in cross-domain few-shot learning on CLIP, reversing the pattern seen in standard in-domain tasks. Analysis reveals LoRA succeeds by correcting the collapsed attention of the visual CLS token, which improves how visual features align with text and separate classes by focusing on relevant regions. The authors introduce Semantic Probe, a plug-and-play framework that applies similar attention fixes to both adapter and prompt approaches. Experiments across four CDFSL benchmarks confirm the approach reaches state-of-the-art results while benefiting both fine-tuning styles. The work highlights that attention patterns and modality alignment constraints are key to adapting pretrained vision-language models under limited target samples.

Core claim

LoRA's superiority in CDFSL stems from rectifying the collapsed attention of the visual CLS token, which enhances modality alignment and class separation by directing focus to text-related visual regions. Textual EOS tokens exhibit stronger attention to visual samples, while CLIP's standard contrastive loss provides only weak constraints on alignment. Semantic Probe is introduced as a general attention rectification framework that plugs into both adapter- and prompt-based methods to restore these benefits, delivering state-of-the-art performance on four CDFSL benchmarks.

What carries the argument

Semantic Probe, a plug-and-play attention rectification framework that adjusts the attention of visual CLS tokens in fine-tuning methods to restore focus on text-related regions and strengthen modality alignment.

If this is right

  • Both adapter and prompt fine-tuning methods become viable for CDFSL once attention collapse is addressed.
  • Focusing on text-related visual regions improves class separation in low-data cross-domain settings.
  • Textual EOS tokens can serve as a stronger anchor for visual alignment than CLS tokens alone.
  • Standard contrastive loss in CLIP needs supplementation to better enforce modality alignment across domains.
  • The same rectification principle scales to multiple fine-tuning paradigms without architecture-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same attention analysis could apply to other vision-language models that use CLS tokens and contrastive pretraining.
  • Domains with strong visual-text mismatch, such as medical or satellite imagery, might benefit most from explicit rectification.
  • Combining Semantic Probe with stronger alignment losses could yield further gains beyond current SOTA.
  • Attention collapse may explain why some pretrained models fail to adapt quickly in few-shot regimes.

Load-bearing premise

The superiority of LoRA in cross-domain settings comes specifically from rectifying collapsed visual CLS token attention, and this fix can be turned into a general plug-and-play framework that works for prompt methods without creating new problems.

What would settle it

A controlled test where attention rectification is applied only to the visual CLS token in a prompt-based method like MaPLe, with all other components unchanged, and performance on CDFSL benchmarks shows no improvement or drops.

Figures

Figures reproduced from arXiv: 2605.11659 by Ruixuan Li, Yaze Zhao, Yicong Liu, Yixiong Zou, Yuhua Li.

Figure 1
Figure 1. Figure 1: (a) Among these fine-tuning methods, we find LoRA’s performance in the CDFSL task is consistently higher than that of others, which is opposite to the in-domain scenarios. (b) To understand this reversal, we uncover that the cause lies in the collapsed visual attention on cross-domain samples, preventing the model from focusing on visual regions relevant to the given class-related text, while LoRA can bett… view at source ↗
Figure 2
Figure 2. Figure 2: DOSNES visualization of embeddings of MaPLe and LoRA-CLIP. LoRA-CLIP (right) yields a smaller modality gap and tighter clusters. and 𝑉 are the text and image encoders, 𝐹 is the vision￾language coupling function, and 𝑐 is the visual class token. LoRA is an adapter-based reparameterization tuning ap￾proach [12] that enhances the efficiency of fine-tuning large￾scale models by injecting low-rank matrices into… view at source ↗
Figure 4
Figure 4. Figure 4: The modality alignment and class separation metrics of the last-layer features evolve with training epochs for both LoRA-CLIP and MaPLe: (a) align_score (higher is better). (b) modality_gap (lower is better). (c) Calinski-Harabasz Index (higher is better). LoRA-CLIP is more effective than MaPLe at optimizing both modality alignment and class separation during fine-tuning for CDFSL tasks [PITH_FULL_IMAGE:f… view at source ↗
Figure 5
Figure 5. Figure 5: Attention scores of the CLS token on image tokens and the changes after fine-tuning. Before fine-tuning, the CLS token exhibits severe self-attention and neglects informative regions. LoRA markedly shifts attention toward class-relevant patches after fine-tuning, while MaPLe yields almost no change. to MaPLe. Fig. 4c depicts the CH Index for both visual and textual features over training epochs, revealing … view at source ↗
Figure 6
Figure 6. Figure 6: (a) Heatmaps of the visual CLS and textual EOS token’s attention to image. The EOS token demonstrates significantly stronger capability in capturing visual semantic information compared to the CLS token. (b) EOS-guided CLS Attention Rectification. separation, thereby significantly improving CLIP’s perfor￾mance in CDFSL tasks. 2.4. Inspiration: addressing attention collapse by introducing text information S… view at source ↗
Figure 7
Figure 7. Figure 7: (a) Overview of our Semantic Probe framework, which revives in-domain fine-tuning methods for CDFSL. The EOS￾guided Attention Rectification module is plugged to the final layers of CLIP to rectify attention, while the Balanced Alignment and Separation loss replaces the original contrastive loss to dynamically guide the model’s focus between modality alignment and class separation. (b) EAR module. The atten… view at source ↗
Figure 10
Figure 10. Figure 10: Samples from ISIC2018 [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Samples from ChestX [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 9
Figure 9. Figure 9: Samples from EuroSAT. dermatofibroma, or a vascular lesion). ISIC2018 images are even less similar to miniImagenet as they have lost perspec￾tive distortion and no longer represent natural scenes. The samples of this dataset are listed in [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗
Figure 12
Figure 12. Figure 12: The alignment/aggregation metrics under different losses across training epochs. The modality alignment is only marginally affected by the choice of loss function, whereas the class aggregation improves significantly after the text-to-image term 𝐿𝑡→𝑖 is introduced. This suggests that CLIP’s InfoNCE loss primarily emphasizes class separation, with relatively weak constraints on modality alignment. C. Contr… view at source ↗
Figure 13
Figure 13. Figure 13: Performance of our Semantic Probe method under different configurations: (a) the steepness parameter 𝑘, (b) the decay threshold 𝑇 , (c) the maximum initial value 𝑤, (d) the EOS-guided attention rectification weighting coefficient 𝛼. (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: (a) DOSNES visualization of features under various CLIP-based methods. (b) The CH index of visual features across CDFSL target datasets. generally beneficial. Performance then plateaus, demonstrat￾ing that the model is not overly sensitive to the exact value once an effective level of steepness is achieved. Fig.13b shows the impact of the threshold 𝑇 , which dictates when the transition begins. Performanc… view at source ↗
read the original abstract

Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 3 minor

Summary. The manuscript establishes multiple CLIP fine-tuning baselines for source-free cross-domain few-shot learning (CDFSL) and observes that adapter-based methods (e.g., LoRA) consistently outperform prompt-based methods (e.g., MaPLe), in contrast to in-domain behavior. Through analysis, it attributes LoRA's advantage to rectification of collapsed visual CLS-token attention (focusing on text-related regions and improving modality alignment and class separation), notes superior attention behavior of the textual EOS token, and identifies weak constraints from CLIP's contrastive loss. It proposes Semantic Probe, a plug-and-play attention-rectification framework applicable to both adapter and prompt paradigms, and reports state-of-the-art results on four CDFSL benchmarks.

Significance. If the proposed causal mechanism and transferability hold, the work would offer a practical, general-purpose way to revive strong in-domain fine-tuning techniques for cross-domain settings, addressing modality misalignment in CLIP-based CDFSL. The plug-and-play design and reported gains on multiple benchmarks could influence subsequent adapter/prompt research in few-shot and domain-adaptation literature.

major comments (3)
  1. [Analysis section] Analysis section (description of LoRA vs. prompt comparison): the claim that LoRA's superiority 'stems from rectifying the collapsed attention of visual CLS token' is presented as the key insight motivating Semantic Probe, yet the manuscript provides only observational comparisons without controlled ablations that isolate attention rectification from confounding factors such as parameter placement, optimization trajectory, or gradient flow differences between the two paradigms.
  2. [Semantic Probe framework] Semantic Probe framework description and experiments: while the method is claimed to be plug-and-play for prompt-based tuning, no ablation demonstrates that the attention rectification recovers the full performance gap observed with LoRA or that it avoids introducing new cross-domain failure modes (e.g., over-focusing on spurious text-related regions that hurt generalization on certain target domains).
  3. [Experiments] Experimental validation (four CDFSL benchmarks): the link between the proposed attention rectification and the reported SOTA gains is not supported by direct measurements (e.g., quantitative attention maps or modality-alignment metrics before/after Semantic Probe) that would confirm the mechanism rather than post-hoc correlation.
minor comments (3)
  1. The abstract states 'Codes will be released' but provides no repository link or supplementary material reference; this should be added for reproducibility.
  2. [Method] Notation for attention rectification (e.g., how the probe modifies CLS/EOS tokens) should be formalized with an equation or pseudocode to improve clarity.
  3. [Experiments] Ensure all baseline comparisons report mean and standard deviation over multiple random seeds or runs, and clarify whether the same hyperparameter search budget was used for adapters and prompts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have addressed each of the major comments below and will incorporate revisions to strengthen the paper.

read point-by-point responses
  1. Referee: [Analysis section] Analysis section (description of LoRA vs. prompt comparison): the claim that LoRA's superiority 'stems from rectifying the collapsed attention of visual CLS token' is presented as the key insight motivating Semantic Probe, yet the manuscript provides only observational comparisons without controlled ablations that isolate attention rectification from confounding factors such as parameter placement, optimization trajectory, or gradient flow differences between the two paradigms.

    Authors: We acknowledge that our current analysis is based on observational comparisons. To strengthen the causal claim, we will perform additional controlled ablations in the revision. These will include isolating the attention rectification effect by applying similar constraints to prompt methods and analyzing differences in optimization and gradient flow. We believe this will better isolate the contribution of attention rectification. revision: yes

  2. Referee: [Semantic Probe framework] Semantic Probe framework description and experiments: while the method is claimed to be plug-and-play for prompt-based tuning, no ablation demonstrates that the attention rectification recovers the full performance gap observed with LoRA or that it avoids introducing new cross-domain failure modes (e.g., over-focusing on spurious text-related regions that hurt generalization on certain target domains).

    Authors: We thank the referee for highlighting this. In the revised manuscript, we will add ablations that apply Semantic Probe to prompt-based methods and measure how much of the LoRA performance gap is recovered. Additionally, we will examine attention maps across different target domains to check for potential over-focusing on spurious regions and ensure no new failure modes are introduced. revision: yes

  3. Referee: [Experiments] Experimental validation (four CDFSL benchmarks): the link between the proposed attention rectification and the reported SOTA gains is not supported by direct measurements (e.g., quantitative attention maps or modality-alignment metrics before/after Semantic Probe) that would confirm the mechanism rather than post-hoc correlation.

    Authors: We agree that direct quantitative evidence is important to confirm the mechanism. We will include in the revision quantitative attention metrics (such as the proportion of attention on text-related regions) and modality alignment scores before and after Semantic Probe application. This will provide stronger support for the link between attention rectification and the observed performance improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical observations of attention patterns

full rationale

The paper establishes baselines showing adapter methods outperform prompts in CDFSL, then reports observational analysis of CLS-token attention collapse in LoRA versus prompts, leading to the Semantic Probe framework. No equations, fitted parameters, or predictions reduce to inputs by construction. No self-citations serve as load-bearing uniqueness theorems or ansatzes. The central claims are grounded in benchmark experiments and attention visualizations rather than self-referential definitions or renamings of known results. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work relies on standard assumptions about CLIP's architecture and contrastive training, with the new method introduced as an empirical fix.

axioms (1)
  • domain assumption CLIP's standard contrastive loss weakly constrains modality alignment
    Stated as a finding that motivates the need for attention rectification.
invented entities (1)
  • Semantic Probe no independent evidence
    purpose: Plug-and-play attention rectification framework to enhance both adapter and prompt fine-tuning in CDFSL
    Newly proposed method based on attention analysis.

pith-pipeline@v0.9.0 · 5508 in / 1202 out tokens · 39669 ms · 2026-05-13T01:20:05.197746+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 2 internal anchors

  1. [1]

    Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

    Codella, N., Rotemberg, V., Tschandl, P., Celebi, M.E., Dusza, S., Gutman, D., Helba, B., Kalloo, A., Liopyris, K., Marchetti, M., et al., 2019. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368

  2. [2]

    An image is worth 16x16 words: Transformers for image recognition at scale, in: ICLR 2021

    Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An image is worth 16x16 words: Transformers for image recognition at scale, in: ICLR 2021

  3. [3]

    arXiv preprint arXiv:2406.17639 (2024)

    Eslami, S., de Melo, G., 2024. Mitigate the gap: Investigating approaches for improving cross-modal alignment in clip. arXiv preprint arXiv:2406.17639

  4. [4]

    Fu,Y.,Xie,Y.,Fu,Y.,Jiang,Y.,2023.Styleadv:Metastyleadversarial training for cross-domain few-shot learning, in: IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, IEEE. pp. 24575–24584

  5. [5]

    Clip-adapter: Better vision-language models with feature adapters

    Gao,P.,Geng,S.,Zhang,R.,Ma,T.,Fang,R.,Zhang,Y.,Li,H.,Qiao, Y., 2024. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision 132, 581–595. Yaze Zhao and Yicong Liu et al.:Preprint submitted to ElsevierPage 13 of 16 Semantic Probe for SF-CDFSL Table 8 Comparison with state-of-the-art works by the 5-way 1-...

  6. [6]

    Cyclip: Cyclic contrastive language-image pretraining

    Goel,S.,Bansal,H.,Bhatia,S.,Rossi,R.,Vinay,V.,Grover,A.,2022. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems 35, 6704–6719

  7. [7]

    A broader study of cross- domainfew-shotlearning,in:Computervision–ECCV2020:16thEu- ropean conference, glasgow, UK, August 23–28, 2020, proceedings, part XXVII 16, Springer

    Guo, Y., Codella, N.C., Karlinsky, L., Codella, J.V., Smith, J.R., Saenko, K., Rosing, T., Feris, R., 2020. A broader study of cross- domainfew-shotlearning,in:Computervision–ECCV2020:16thEu- ropean conference, glasgow, UK, August 23–28, 2020, proceedings, part XXVII 16, Springer. pp. 124–141

  8. [8]

    Guo, Y., Gu, X., 2025. MMRL: multi-modal representation learning for vision-language models, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, Computer Vision Foundation / IEEE. pp. 25015– 25025

  9. [9]

    Helber, P., Bischke, B., Dengel, A., Borth, D., 2019. Eurosat: A novel dataset and deep learning benchmark for land use and land coverclassification.IEEEJournalofSelectedTopicsinAppliedEarth Observations and Remote Sensing 12, 2217–2226

  10. [10]

    Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., 2022a. Lora: Low-rank adaptation of large language models, in: The Tenth International Conference on Learning Repre- sentations, ICLR 2022, Virtual Event, April 25-29, 2022, OpenRe- view.net

  11. [11]

    Pushingthelimitsofsimplepipelinesforfew-shotlearning:External data and fine-tuning make a difference, in: CVPR 2022, IEEE

    Hu, S.X., Li, D., Stühmer, J., Kim, M., Hospedales, T.M., 2022b. Pushingthelimitsofsimplepipelinesforfew-shotlearning:External data and fine-tuning make a difference, in: CVPR 2022, IEEE. pp. 9058–9067

  12. [12]

    Llm-adapters: An adapter family for parameter- efficientfine-tuningoflargelanguagemodels,in:Bouamor,H.,Pino, J., Bali, K

    Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E., Bing, L., Xu, X., Poria, S., Lee, R.K., 2023. Llm-adapters: An adapter family for parameter- efficientfine-tuningoflargelanguagemodels,in:Bouamor,H.,Pino, J., Bali, K. (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December6-10,2023,Associat...

  13. [13]

    Huang, Y., Shakeri, F., Dolz, J., Boudiaf, M., Bahig, H., Ayed, I.B.,

  14. [14]

    Lp++: A surprisingly strong linear probe for few-shot clip, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

  15. [15]

    Visual prompt tuning, in: European conference on computer vision, Springer

    Jia,M.,Tang,L.,Chen,B.C.,Cardie,C.,Belongie,S.,Hariharan,B., Lim, S.N., 2022. Visual prompt tuning, in: European conference on computer vision, Springer. pp. 709–727. Yaze Zhao and Yicong Liu et al.:Preprint submitted to ElsevierPage 14 of 16 Semantic Probe for SF-CDFSL

  16. [16]

    Maple: Multi-modal prompt learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S., 2023a. Maple: Multi-modal prompt learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19113–19122

  17. [17]

    Self-regulatingprompts:Foundationalmodeladaptation without forgetting, in: IEEE/CVF International Conference on Com- puterVision,ICCV2023,Paris,France,October1-6,2023,IEEE.pp

    Khattak, M.U., Wasim, S.T., Naseer, M., Khan, S., Yang, M., Khan, F.S.,2023b. Self-regulatingprompts:Foundationalmodeladaptation without forgetting, in: IEEE/CVF International Conference on Com- puterVision,ICCV2023,Paris,France,October1-6,2023,IEEE.pp. 15144–15154

  18. [18]

    Li, S., Liu, F., Hao, Z., Wang, X., Li, L., Liu, X., Chen, P., Ma, W.,

  19. [19]

    25411–25421

    Logits deconfusion with clip for few-shot learning, in: Pro- ceedingsoftheComputerVisionandPatternRecognitionConference (CVPR), pp. 25411–25421

  20. [20]

    9424–9434

    Liang,H.,Zhang,Q.,Dai,P.,Lu,J.,2021.Boostingthegeneralization capability in cross-domain few-shot learning via noise-enhanced su- pervisedautoencoder,in:ProceedingsoftheIEEE/CVFinternational conference on computer vision, pp. 9424–9434

  21. [21]

    Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

    Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y., 2022. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems 35, 17612–17625

  22. [22]

    The Llama 3 Herd of Models

    Llama Team, 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. URL:https://arxiv.org/abs/2407.21783, doi:10. 48550/arXiv.2407.21783

  23. [23]

    Doubly stochastic neighbor embedding on spheres

    Lu, Y., Corander, J., Yang, Z., 2016. Doubly stochastic neighbor embedding on spheres. arXiv preprint arXiv:1609.01977

  24. [24]

    Reconstruction target matters in masked image modeling for cross-domain few-shot learning, in: Walsh, T., Shah, J., Kolter, Z

    Ma, R., Zou, Y., Li, Y., Li, R., 2025. Reconstruction target matters in masked image modeling for cross-domain few-shot learning, in: Walsh, T., Shah, J., Kolter, Z. (Eds.), AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25-March4,2025,Philadelphia,PA,USA,AAAIPress.pp.19305– 19313

  25. [25]

    Using deep learning for image-based plant disease detection

    Mohanty, S.P., Hughes, D.P., Salathé, M., 2016. Using deep learning for image-based plant disease detection. Frontiers in plant science 7, 215232

  26. [26]

    Representation Learning with Contrastive Predictive Coding

    van den Oord, A., Li, Y., Vinyals, O., 2018. Representation learning withcontrastivepredictivecoding. arXivpreprintarXiv:1807.03748. URL:https://arxiv.org/abs/1807.03748

  27. [27]

    Pratt, S.M., Covert, I., Liu, R., Farhadi, A., 2023. What does a platy- pus look like? generating customized prompts for zero-shot image classification, in: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, IEEE. pp. 15645–15655

  28. [28]

    Learning transferable visual models from natural language supervision, in: Internationalconferenceonmachinelearning,PmLR.pp.8748–8763

    Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.,Sastry,G.,Askell,A.,Mishkin,P.,Clark,J.,etal.,2021. Learning transferable visual models from natural language supervision, in: Internationalconferenceonmachinelearning,PmLR.pp.8748–8763

  29. [29]

    Two effects, one trigger: on the modality gap, object bias, and information imbalance in contrastive vision-language representation learning

    Schrodi, S., Hoffmann, D.T., Argus, M., Fischer, V., Brox, T., 2024. Two effects, one trigger: on the modality gap, object bias, and information imbalance in contrastive vision-language representation learning. arXiv preprint arXiv:2404.07983

  30. [30]

    Tang, Y., Lin, Z., Wang, Q., Zhu, P., Hu, Q., 2024. Amu-tuning: Effective logit bias for clip-based few-shot learning, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, IEEE. pp. 23323–23333

  31. [31]

    arXiv preprint arXiv:2001.08735

    Tseng,H.Y.,Lee,H.Y.,Huang,J.B.,Yang,M.H.,2020.Cross-domain few-shotclassificationvialearnedfeature-wisetransformation. arXiv preprint arXiv:2001.08735

  32. [32]

    On isotropy of multimodal embed- dings

    Tyshchuk, K., Karpikova, P., Spiridonov, A., Prutianova, A., Razzhi- gaev, A., Panchenko, A., 2023. On isotropy of multimodal embed- dings. Information 14, 392

  33. [33]

    Masked embedding modeling with rapid domain adjustment for few-shot image classi- fication

    Walsh, R., Osman, I.I., Shehata, M.S., 2023. Masked embedding modeling with rapid domain adjustment for few-shot image classi- fication. IEEE Trans. Image Process. , 4907–4920

  34. [34]

    Cross-domain few-shot classification via adversarial task augmentation

    Wang, H., Deng, Z.H., 2021. Cross-domain few-shot classification via adversarial task augmentation. arXiv preprint arXiv:2104.14385

  35. [35]

    Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.,

  36. [36]

    Chestx-ray8: Hospital-scale chest x-ray database and bench- marks on weakly-supervised classification and localization of com- mon thorax diseases, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21- 26, 2017, IEEE Computer Society. pp. 3462–3471

  37. [37]

    Flair: Vlm with fine-grained language-informed image representa- tions

    Xiao, R., Kim, S., Georgescu, M.I., Akata, Z., Alaniz, S., 2024. Flair: Vlm with fine-grained language-informed image representa- tions. arXiv preprint arXiv:2412.03561

  38. [38]

    Enhancinginformationmaximizationwithdistance-awarecontrastive learningforsource-freecross-domainfew-shotlearning

    Xu, H., Liu, L., Zhi, S., Fu, S., Su, Z., Cheng, M., Liu, Y., 2024a. Enhancinginformationmaximizationwithdistance-awarecontrastive learningforsource-freecross-domainfew-shotlearning. IEEETrans. Image Process. , 2058–2073

  39. [39]

    Step-wise distribution alignment guided style prompt tun- ing for source-free cross-domain few-shot learning

    Xu, H., Liu, Y., Liu, L., Zhi, S., Sun, S., Liu, T., Cheng, M.M., 2024b. Step-wise distribution alignment guided style prompt tun- ing for source-free cross-domain few-shot learning. arXiv preprint arXiv:2411.10070. URL:https://arxiv.org/abs/2411.10070

  40. [40]

    Mma: Multi-modal adapterforvision-languagemodels,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

    Yang, L., Zhang, R.Y., Wang, Y., Xie, X., 2024. Mma: Multi-modal adapterforvision-languagemodels,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23826–23837

  41. [41]

    Yang,Y.,Deng,J.,Li,W.,Duan,L.,2025. Resclip:Residualattention fortraining-freedensevision-languageinference,in:IEEE/CVFCon- ference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville,TN,USA,June11-15,2025,ComputerVisionFoundation / IEEE. pp. 29968–29978

  42. [42]

    Yazdanpanah, M., Moradi, P., 2022. Visual domain bridge: A source-freedomainadaptationforcross-domainfew-shotlearning,in: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2022, New Orleans, LA, USA, June 19-20, 2022, IEEE. pp. 2867–2876. URL:https://doi.org/10.1109/ CVPRW56347.2022.00324, doi:10.1109/CVPRW56347.2022.00324

  43. [43]

    Zanella, M., Ayed, I.B., 2024. Low-rank few-shot adaptation of vision-language models, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024, IEEE. pp. 1593–1603

  44. [44]

    Sigmoid loss for language image pre-training, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

    Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L., 2023. Sigmoid loss for language image pre-training, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986

  45. [45]

    arXiv preprint arXiv:2111.03930 , year=

    Zhang, R., Fang, R., Zhang, W., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H., 2021. Tip-adapter: Training-free clip-adapter for better vision- language modeling. arXiv preprint arXiv:2111.03930

  46. [46]

    Tip-adapter: Training-free adaption of CLIP for few-shot classification, in: Computer Vision - ECCV 2022 - 17th European Conference, pp

    Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H., 2022. Tip-adapter: Training-free adaption of CLIP for few-shot classification, in: Computer Vision - ECCV 2022 - 17th European Conference, pp. 493–510

  47. [47]

    Revisiting prototypicalnetworkforcrossdomainfew-shotlearning,in:Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp

    Zhou, F., Wang, P., Zhang, L., Wei, W., Zhang, Y., 2023. Revisiting prototypicalnetworkforcrossdomainfew-shotlearning,in:Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 20061–20070

  48. [48]

    Learning to prompt for vision-language models

    Zhou, K., Yang, J., Loy, C.C., Liu, Z., 2022. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 2337–2348

  49. [49]

    Prompt as free lunch: Enhancing diversity in source-free cross-domain few-shot learning through semantic-guided prompting

    Zhuo, L., Wang, Z., Fu, Y., Qian, T., 2024. Prompt as free lunch: Enhancing diversity in source-free cross-domain few-shot learning through semantic-guided prompting. arXiv preprint arXiv:2412.00767. URL:https://arxiv.org/abs/2412.00767

  50. [50]

    Flatten long-range loss landscapes for cross-domain few-shot learning, in: CVPR 2024, IEEE

    Zou, Y., Liu, Y., Hu, Y., Li, Y., Li, R., 2024a. Flatten long-range loss landscapes for cross-domain few-shot learning, in: CVPR 2024, IEEE. pp. 23575–23584

  51. [51]

    Attention temperature matters in vit-based cross-domain few-shot learning

    Zou, Y., Ma, R., Li, Y., Li, R., 2024b. Attention temperature matters in vit-based cross-domain few-shot learning. Advances in Neural Information Processing Systems 37, 116332–116354

  52. [52]

    A closer look at the CLS token for cross-domain few-shot learning, in: NeurIPS 2024

    Zou, Y., Yi, S., Li, Y., Li, R., 2024c. A closer look at the CLS token for cross-domain few-shot learning, in: NeurIPS 2024. Yaze Zhao and Yicong Liu et al.:Preprint submitted to ElsevierPage 15 of 16 Semantic Probe for SF-CDFSL Yaze Zhaoreceived the B.S. degree from the School of Computer Science and Technology, Huazhong University of Science and Technol...

  53. [53]

    She was a visiting scholar at the University of California, Santa Barbara

    She is currently a Professor in the School of Computer Science and Technology, Huazhong University of Science and Technology. She was a visiting scholar at the University of California, Santa Barbara. She has published more than 60 journal and conference papers (NeurIPS, TKDE, SIGIR,WWW,ICDM,IJCAI).Sheisalsoasenior memberoftheChinaComputerFederation(CCF)....