arxiv: 2605.11659 · v1 · submitted 2026-05-12 · 💻 cs.CV · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Reviving In-domain Fine-tuning Methods for Source-Free Cross-domain Few-shot Learning

Yaze Zhao , Yicong Liu , Yixiong Zou , Yuhua Li , Ruixuan Li

Authors on Pith no claims yet

Pith reviewed 2026-05-13 01:20 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords cross-domain few-shot learningCLIP fine-tuningattention rectificationadapter methodsprompt tuningmodality alignmentsemantic probevision-language models

0 comments

The pith

Rectifying collapsed attention in CLIP makes prompt-based fine-tuning competitive again for cross-domain few-shot learning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes baselines showing adapter methods like LoRA outperform prompt methods like MaPLe in cross-domain few-shot learning on CLIP, reversing the pattern seen in standard in-domain tasks. Analysis reveals LoRA succeeds by correcting the collapsed attention of the visual CLS token, which improves how visual features align with text and separate classes by focusing on relevant regions. The authors introduce Semantic Probe, a plug-and-play framework that applies similar attention fixes to both adapter and prompt approaches. Experiments across four CDFSL benchmarks confirm the approach reaches state-of-the-art results while benefiting both fine-tuning styles. The work highlights that attention patterns and modality alignment constraints are key to adapting pretrained vision-language models under limited target samples.

Core claim

LoRA's superiority in CDFSL stems from rectifying the collapsed attention of the visual CLS token, which enhances modality alignment and class separation by directing focus to text-related visual regions. Textual EOS tokens exhibit stronger attention to visual samples, while CLIP's standard contrastive loss provides only weak constraints on alignment. Semantic Probe is introduced as a general attention rectification framework that plugs into both adapter- and prompt-based methods to restore these benefits, delivering state-of-the-art performance on four CDFSL benchmarks.

What carries the argument

Semantic Probe, a plug-and-play attention rectification framework that adjusts the attention of visual CLS tokens in fine-tuning methods to restore focus on text-related regions and strengthen modality alignment.

If this is right

Both adapter and prompt fine-tuning methods become viable for CDFSL once attention collapse is addressed.
Focusing on text-related visual regions improves class separation in low-data cross-domain settings.
Textual EOS tokens can serve as a stronger anchor for visual alignment than CLS tokens alone.
Standard contrastive loss in CLIP needs supplementation to better enforce modality alignment across domains.
The same rectification principle scales to multiple fine-tuning paradigms without architecture-specific redesign.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same attention analysis could apply to other vision-language models that use CLS tokens and contrastive pretraining.
Domains with strong visual-text mismatch, such as medical or satellite imagery, might benefit most from explicit rectification.
Combining Semantic Probe with stronger alignment losses could yield further gains beyond current SOTA.
Attention collapse may explain why some pretrained models fail to adapt quickly in few-shot regimes.

Load-bearing premise

The superiority of LoRA in cross-domain settings comes specifically from rectifying collapsed visual CLS token attention, and this fix can be turned into a general plug-and-play framework that works for prompt methods without creating new problems.

What would settle it

A controlled test where attention rectification is applied only to the visual CLS token in a prompt-based method like MaPLe, with all other components unchanged, and performance on CDFSL benchmarks shows no improvement or drops.

Figures

Figures reproduced from arXiv: 2605.11659 by Ruixuan Li, Yaze Zhao, Yicong Liu, Yixiong Zou, Yuhua Li.

**Figure 1.** Figure 1: (a) Among these fine-tuning methods, we find LoRA’s performance in the CDFSL task is consistently higher than that of others, which is opposite to the in-domain scenarios. (b) To understand this reversal, we uncover that the cause lies in the collapsed visual attention on cross-domain samples, preventing the model from focusing on visual regions relevant to the given class-related text, while LoRA can bett… view at source ↗

**Figure 2.** Figure 2: DOSNES visualization of embeddings of MaPLe and LoRA-CLIP. LoRA-CLIP (right) yields a smaller modality gap and tighter clusters. and 𝑉 are the text and image encoders, 𝐹 is the visionlanguage coupling function, and 𝑐 is the visual class token. LoRA is an adapter-based reparameterization tuning approach [12] that enhances the efficiency of fine-tuning largescale models by injecting low-rank matrices into… view at source ↗

**Figure 4.** Figure 4: The modality alignment and class separation metrics of the last-layer features evolve with training epochs for both LoRA-CLIP and MaPLe: (a) align_score (higher is better). (b) modality_gap (lower is better). (c) Calinski-Harabasz Index (higher is better). LoRA-CLIP is more effective than MaPLe at optimizing both modality alignment and class separation during fine-tuning for CDFSL tasks [PITH_FULL_IMAGE:f… view at source ↗

**Figure 5.** Figure 5: Attention scores of the CLS token on image tokens and the changes after fine-tuning. Before fine-tuning, the CLS token exhibits severe self-attention and neglects informative regions. LoRA markedly shifts attention toward class-relevant patches after fine-tuning, while MaPLe yields almost no change. to MaPLe. Fig. 4c depicts the CH Index for both visual and textual features over training epochs, revealing … view at source ↗

**Figure 6.** Figure 6: (a) Heatmaps of the visual CLS and textual EOS token’s attention to image. The EOS token demonstrates significantly stronger capability in capturing visual semantic information compared to the CLS token. (b) EOS-guided CLS Attention Rectification. separation, thereby significantly improving CLIP’s performance in CDFSL tasks. 2.4. Inspiration: addressing attention collapse by introducing text information S… view at source ↗

**Figure 7.** Figure 7: (a) Overview of our Semantic Probe framework, which revives in-domain fine-tuning methods for CDFSL. The EOSguided Attention Rectification module is plugged to the final layers of CLIP to rectify attention, while the Balanced Alignment and Separation loss replaces the original contrastive loss to dynamically guide the model’s focus between modality alignment and class separation. (b) EAR module. The atten… view at source ↗

**Figure 10.** Figure 10: Samples from ISIC2018 [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗

**Figure 11.** Figure 11: Samples from ChestX [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗

**Figure 9.** Figure 9: Samples from EuroSAT. dermatofibroma, or a vascular lesion). ISIC2018 images are even less similar to miniImagenet as they have lost perspective distortion and no longer represent natural scenes. The samples of this dataset are listed in [PITH_FULL_IMAGE:figures/full_fig_p011_9.png] view at source ↗

**Figure 12.** Figure 12: The alignment/aggregation metrics under different losses across training epochs. The modality alignment is only marginally affected by the choice of loss function, whereas the class aggregation improves significantly after the text-to-image term 𝐿𝑡→𝑖 is introduced. This suggests that CLIP’s InfoNCE loss primarily emphasizes class separation, with relatively weak constraints on modality alignment. C. Contr… view at source ↗

**Figure 13.** Figure 13: Performance of our Semantic Probe method under different configurations: (a) the steepness parameter 𝑘, (b) the decay threshold 𝑇 , (c) the maximum initial value 𝑤, (d) the EOS-guided attention rectification weighting coefficient 𝛼. (a) (b) [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: (a) DOSNES visualization of features under various CLIP-based methods. (b) The CH index of visual features across CDFSL target datasets. generally beneficial. Performance then plateaus, demonstrating that the model is not overly sensitive to the exact value once an effective level of steepness is achieved. Fig.13b shows the impact of the threshold 𝑇 , which dictates when the transition begins. Performanc… view at source ↗

read the original abstract

Cross-Domain Few-Shot Learning (CDFSL) aims to adapt large-scale pretrained models to specialized target domains with limited samples, yet the few-shot fine-tuning of vision-language models like CLIP remains underexplored. By establishing multiple fine-tuning baselines of CLIP for CDFSL, we find adapter-based methods (e.g., LoRA) consistently outperform prompt-based ones (e.g., MaPLe), contrary to in-domain scenarios. To make those effective in-domain methods competitive again in CDFSL, we analyze this phenomenon and discover LoRA's superiority stems from rectifying the collapsed attention of visual CLS token, enhancing modality alignment and class separation by focusing on text-related visual regions. Further, we find textual EOS token exhibit much better attention to visual samples, and CLIP's standard contrastive loss weakly constrains modality alignment. Based on these insights, we propose Semantic Probe, a plug-and-play attention rectification framework for both adapter- and prompt-based methods. Extensive experiments on four CDFSL benchmarks validate our rationale, achieving state-of-the-art performance and benefiting both fine-tuning paradigms. Codes will be released.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper ties LoRA's CDFSL advantage to fixing collapsed visual CLS attention and turns the observation into a plug-in fix for prompts as well.

read the letter

The main point is that in cross-domain few-shot settings with CLIP, adapter methods like LoRA beat prompt methods because they restore proper attention from the visual CLS token to text-related image regions. The authors trace this to differences in modality alignment that do not appear in standard in-domain fine-tuning, then build Semantic Probe as a lightweight rectification module that can be dropped into either adapters or prompts to recover the missing focus and improve class separation.

Referee Report

3 major / 3 minor

Summary. The manuscript establishes multiple CLIP fine-tuning baselines for source-free cross-domain few-shot learning (CDFSL) and observes that adapter-based methods (e.g., LoRA) consistently outperform prompt-based methods (e.g., MaPLe), in contrast to in-domain behavior. Through analysis, it attributes LoRA's advantage to rectification of collapsed visual CLS-token attention (focusing on text-related regions and improving modality alignment and class separation), notes superior attention behavior of the textual EOS token, and identifies weak constraints from CLIP's contrastive loss. It proposes Semantic Probe, a plug-and-play attention-rectification framework applicable to both adapter and prompt paradigms, and reports state-of-the-art results on four CDFSL benchmarks.

Significance. If the proposed causal mechanism and transferability hold, the work would offer a practical, general-purpose way to revive strong in-domain fine-tuning techniques for cross-domain settings, addressing modality misalignment in CLIP-based CDFSL. The plug-and-play design and reported gains on multiple benchmarks could influence subsequent adapter/prompt research in few-shot and domain-adaptation literature.

major comments (3)

[Analysis section] Analysis section (description of LoRA vs. prompt comparison): the claim that LoRA's superiority 'stems from rectifying the collapsed attention of visual CLS token' is presented as the key insight motivating Semantic Probe, yet the manuscript provides only observational comparisons without controlled ablations that isolate attention rectification from confounding factors such as parameter placement, optimization trajectory, or gradient flow differences between the two paradigms.
[Semantic Probe framework] Semantic Probe framework description and experiments: while the method is claimed to be plug-and-play for prompt-based tuning, no ablation demonstrates that the attention rectification recovers the full performance gap observed with LoRA or that it avoids introducing new cross-domain failure modes (e.g., over-focusing on spurious text-related regions that hurt generalization on certain target domains).
[Experiments] Experimental validation (four CDFSL benchmarks): the link between the proposed attention rectification and the reported SOTA gains is not supported by direct measurements (e.g., quantitative attention maps or modality-alignment metrics before/after Semantic Probe) that would confirm the mechanism rather than post-hoc correlation.

minor comments (3)

The abstract states 'Codes will be released' but provides no repository link or supplementary material reference; this should be added for reproducibility.
[Method] Notation for attention rectification (e.g., how the probe modifies CLS/EOS tokens) should be formalized with an equation or pseudocode to improve clarity.
[Experiments] Ensure all baseline comparisons report mean and standard deviation over multiple random seeds or runs, and clarify whether the same hyperparameter search budget was used for adapters and prompts.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback on our manuscript. We have addressed each of the major comments below and will incorporate revisions to strengthen the paper.

read point-by-point responses

Referee: [Analysis section] Analysis section (description of LoRA vs. prompt comparison): the claim that LoRA's superiority 'stems from rectifying the collapsed attention of visual CLS token' is presented as the key insight motivating Semantic Probe, yet the manuscript provides only observational comparisons without controlled ablations that isolate attention rectification from confounding factors such as parameter placement, optimization trajectory, or gradient flow differences between the two paradigms.

Authors: We acknowledge that our current analysis is based on observational comparisons. To strengthen the causal claim, we will perform additional controlled ablations in the revision. These will include isolating the attention rectification effect by applying similar constraints to prompt methods and analyzing differences in optimization and gradient flow. We believe this will better isolate the contribution of attention rectification. revision: yes
Referee: [Semantic Probe framework] Semantic Probe framework description and experiments: while the method is claimed to be plug-and-play for prompt-based tuning, no ablation demonstrates that the attention rectification recovers the full performance gap observed with LoRA or that it avoids introducing new cross-domain failure modes (e.g., over-focusing on spurious text-related regions that hurt generalization on certain target domains).

Authors: We thank the referee for highlighting this. In the revised manuscript, we will add ablations that apply Semantic Probe to prompt-based methods and measure how much of the LoRA performance gap is recovered. Additionally, we will examine attention maps across different target domains to check for potential over-focusing on spurious regions and ensure no new failure modes are introduced. revision: yes
Referee: [Experiments] Experimental validation (four CDFSL benchmarks): the link between the proposed attention rectification and the reported SOTA gains is not supported by direct measurements (e.g., quantitative attention maps or modality-alignment metrics before/after Semantic Probe) that would confirm the mechanism rather than post-hoc correlation.

Authors: We agree that direct quantitative evidence is important to confirm the mechanism. We will include in the revision quantitative attention metrics (such as the proportion of attention on text-related regions) and modality alignment scores before and after Semantic Probe application. This will provide stronger support for the link between attention rectification and the observed performance improvements. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation rests on empirical observations of attention patterns

full rationale

The paper establishes baselines showing adapter methods outperform prompts in CDFSL, then reports observational analysis of CLS-token attention collapse in LoRA versus prompts, leading to the Semantic Probe framework. No equations, fitted parameters, or predictions reduce to inputs by construction. No self-citations serve as load-bearing uniqueness theorems or ansatzes. The central claims are grounded in benchmark experiments and attention visualizations rather than self-referential definitions or renamings of known results. The derivation chain is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work relies on standard assumptions about CLIP's architecture and contrastive training, with the new method introduced as an empirical fix.

axioms (1)

domain assumption CLIP's standard contrastive loss weakly constrains modality alignment
Stated as a finding that motivates the need for attention rectification.

invented entities (1)

Semantic Probe no independent evidence
purpose: Plug-and-play attention rectification framework to enhance both adapter and prompt fine-tuning in CDFSL
Newly proposed method based on attention analysis.

pith-pipeline@v0.9.0 · 5508 in / 1202 out tokens · 39669 ms · 2026-05-13T01:20:05.197746+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

propose Semantic Probe... EOS-guided Attention Rectification (EAR) module and a dynamic Balanced Alignment and Separation (BAS) loss
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LoRA’s superiority stems from rectifying the collapsed attention of visual [CLS] token

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

53 extracted references · 53 canonical work pages · 2 internal anchors

[1]

Skin Lesion Analysis Toward Melanoma Detection 2018: A Challenge Hosted by the International Skin Imaging Collaboration (ISIC)

Codella, N., Rotemberg, V., Tschandl, P., Celebi, M.E., Dusza, S., Gutman, D., Helba, B., Kalloo, A., Liopyris, K., Marchetti, M., et al., 2019. Skin lesion analysis toward melanoma detection 2018: A challenge hosted by the international skin imaging collaboration (isic). arXiv preprint arXiv:1902.03368

work page Pith review arXiv 2019
[2]

An image is worth 16x16 words: Transformers for image recognition at scale, in: ICLR 2021

Dosovitskiy,A.,Beyer,L.,Kolesnikov,A.,Weissenborn,D.,Zhai,X., Unterthiner, T., Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021. An image is worth 16x16 words: Transformers for image recognition at scale, in: ICLR 2021

work page 2021
[3]

arXiv preprint arXiv:2406.17639 (2024)

Eslami, S., de Melo, G., 2024. Mitigate the gap: Investigating approaches for improving cross-modal alignment in clip. arXiv preprint arXiv:2406.17639

work page arXiv 2024
[4]

Fu,Y.,Xie,Y.,Fu,Y.,Jiang,Y.,2023.Styleadv:Metastyleadversarial training for cross-domain few-shot learning, in: IEEE/CVF Confer- ence on Computer Vision and Pattern Recognition, CVPR 2023, Vancouver, BC, Canada, June 17-24, 2023, IEEE. pp. 24575–24584

work page 2023
[5]

Clip-adapter: Better vision-language models with feature adapters

Gao,P.,Geng,S.,Zhang,R.,Ma,T.,Fang,R.,Zhang,Y.,Li,H.,Qiao, Y., 2024. Clip-adapter: Better vision-language models with feature adapters. International Journal of Computer Vision 132, 581–595. Yaze Zhao and Yicong Liu et al.:Preprint submitted to ElsevierPage 13 of 16 Semantic Probe for SF-CDFSL Table 8 Comparison with state-of-the-art works by the 5-way 1-...

work page 2024
[6]

Cyclip: Cyclic contrastive language-image pretraining

Goel,S.,Bansal,H.,Bhatia,S.,Rossi,R.,Vinay,V.,Grover,A.,2022. Cyclip: Cyclic contrastive language-image pretraining. Advances in Neural Information Processing Systems 35, 6704–6719

work page 2022
[7]

A broader study of cross- domainfew-shotlearning,in:Computervision–ECCV2020:16thEu- ropean conference, glasgow, UK, August 23–28, 2020, proceedings, part XXVII 16, Springer

Guo, Y., Codella, N.C., Karlinsky, L., Codella, J.V., Smith, J.R., Saenko, K., Rosing, T., Feris, R., 2020. A broader study of cross- domainfew-shotlearning,in:Computervision–ECCV2020:16thEu- ropean conference, glasgow, UK, August 23–28, 2020, proceedings, part XXVII 16, Springer. pp. 124–141

work page 2020
[8]

Guo, Y., Gu, X., 2025. MMRL: multi-modal representation learning for vision-language models, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville, TN, USA, June 11-15, 2025, Computer Vision Foundation / IEEE. pp. 25015– 25025

work page 2025
[9]

Helber, P., Bischke, B., Dengel, A., Borth, D., 2019. Eurosat: A novel dataset and deep learning benchmark for land use and land coverclassification.IEEEJournalofSelectedTopicsinAppliedEarth Observations and Remote Sensing 12, 2217–2226

work page 2019
[10]

Hu, E.J., Shen, Y., Wallis, P., Allen-Zhu, Z., Li, Y., Wang, S., Wang, L., Chen, W., 2022a. Lora: Low-rank adaptation of large language models, in: The Tenth International Conference on Learning Repre- sentations, ICLR 2022, Virtual Event, April 25-29, 2022, OpenRe- view.net

work page 2022
[11]

Pushingthelimitsofsimplepipelinesforfew-shotlearning:External data and fine-tuning make a difference, in: CVPR 2022, IEEE

Hu, S.X., Li, D., Stühmer, J., Kim, M., Hospedales, T.M., 2022b. Pushingthelimitsofsimplepipelinesforfew-shotlearning:External data and fine-tuning make a difference, in: CVPR 2022, IEEE. pp. 9058–9067

work page 2022
[12]

Llm-adapters: An adapter family for parameter- efficientfine-tuningoflargelanguagemodels,in:Bouamor,H.,Pino, J., Bali, K

Hu, Z., Wang, L., Lan, Y., Xu, W., Lim, E., Bing, L., Xu, X., Poria, S., Lee, R.K., 2023. Llm-adapters: An adapter family for parameter- efficientfine-tuningoflargelanguagemodels,in:Bouamor,H.,Pino, J., Bali, K. (Eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December6-10,2023,Associat...

work page 2023
[13]

Huang, Y., Shakeri, F., Dolz, J., Boudiaf, M., Bahig, H., Ayed, I.B.,

work page
[14]

Lp++: A surprisingly strong linear probe for few-shot clip, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)

work page
[15]

Visual prompt tuning, in: European conference on computer vision, Springer

Jia,M.,Tang,L.,Chen,B.C.,Cardie,C.,Belongie,S.,Hariharan,B., Lim, S.N., 2022. Visual prompt tuning, in: European conference on computer vision, Springer. pp. 709–727. Yaze Zhao and Yicong Liu et al.:Preprint submitted to ElsevierPage 14 of 16 Semantic Probe for SF-CDFSL

work page 2022
[16]

Maple: Multi-modal prompt learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Khattak, M.U., Rasheed, H., Maaz, M., Khan, S., Khan, F.S., 2023a. Maple: Multi-modal prompt learning, in: Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 19113–19122

work page
[17]

Self-regulatingprompts:Foundationalmodeladaptation without forgetting, in: IEEE/CVF International Conference on Com- puterVision,ICCV2023,Paris,France,October1-6,2023,IEEE.pp

Khattak, M.U., Wasim, S.T., Naseer, M., Khan, S., Yang, M., Khan, F.S.,2023b. Self-regulatingprompts:Foundationalmodeladaptation without forgetting, in: IEEE/CVF International Conference on Com- puterVision,ICCV2023,Paris,France,October1-6,2023,IEEE.pp. 15144–15154

work page 2023
[18]

Li, S., Liu, F., Hao, Z., Wang, X., Li, L., Liu, X., Chen, P., Ma, W.,

work page
[19]

25411–25421

Logits deconfusion with clip for few-shot learning, in: Pro- ceedingsoftheComputerVisionandPatternRecognitionConference (CVPR), pp. 25411–25421

work page
[20]

9424–9434

Liang,H.,Zhang,Q.,Dai,P.,Lu,J.,2021.Boostingthegeneralization capability in cross-domain few-shot learning via noise-enhanced su- pervisedautoencoder,in:ProceedingsoftheIEEE/CVFinternational conference on computer vision, pp. 9424–9434

work page 2021
[21]

Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning

Liang, V.W., Zhang, Y., Kwon, Y., Yeung, S., Zou, J.Y., 2022. Mind the gap: Understanding the modality gap in multi-modal contrastive representation learning. Advances in Neural Information Processing Systems 35, 17612–17625

work page 2022
[22]

The Llama 3 Herd of Models

Llama Team, 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783. URL:https://arxiv.org/abs/2407.21783, doi:10. 48550/arXiv.2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[23]

Doubly stochastic neighbor embedding on spheres

Lu, Y., Corander, J., Yang, Z., 2016. Doubly stochastic neighbor embedding on spheres. arXiv preprint arXiv:1609.01977

work page arXiv 2016
[24]

Reconstruction target matters in masked image modeling for cross-domain few-shot learning, in: Walsh, T., Shah, J., Kolter, Z

Ma, R., Zou, Y., Li, Y., Li, R., 2025. Reconstruction target matters in masked image modeling for cross-domain few-shot learning, in: Walsh, T., Shah, J., Kolter, Z. (Eds.), AAAI-25, Sponsored by the Association for the Advancement of Artificial Intelligence, February 25-March4,2025,Philadelphia,PA,USA,AAAIPress.pp.19305– 19313

work page 2025
[25]

Using deep learning for image-based plant disease detection

Mohanty, S.P., Hughes, D.P., Salathé, M., 2016. Using deep learning for image-based plant disease detection. Frontiers in plant science 7, 215232

work page 2016
[26]

Representation Learning with Contrastive Predictive Coding

van den Oord, A., Li, Y., Vinyals, O., 2018. Representation learning withcontrastivepredictivecoding. arXivpreprintarXiv:1807.03748. URL:https://arxiv.org/abs/1807.03748

work page internal anchor Pith review Pith/arXiv arXiv 2018
[27]

Pratt, S.M., Covert, I., Liu, R., Farhadi, A., 2023. What does a platy- pus look like? generating customized prompts for zero-shot image classification, in: IEEE/CVF International Conference on Computer Vision, ICCV 2023, Paris, France, October 1-6, 2023, IEEE. pp. 15645–15655

work page 2023
[28]

Learning transferable visual models from natural language supervision, in: Internationalconferenceonmachinelearning,PmLR.pp.8748–8763

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S.,Sastry,G.,Askell,A.,Mishkin,P.,Clark,J.,etal.,2021. Learning transferable visual models from natural language supervision, in: Internationalconferenceonmachinelearning,PmLR.pp.8748–8763

work page 2021
[29]

Two effects, one trigger: on the modality gap, object bias, and information imbalance in contrastive vision-language representation learning

Schrodi, S., Hoffmann, D.T., Argus, M., Fischer, V., Brox, T., 2024. Two effects, one trigger: on the modality gap, object bias, and information imbalance in contrastive vision-language representation learning. arXiv preprint arXiv:2404.07983

work page arXiv 2024
[30]

Tang, Y., Lin, Z., Wang, Q., Zhu, P., Hu, Q., 2024. Amu-tuning: Effective logit bias for clip-based few-shot learning, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024, Seattle, WA, USA, June 16-22, 2024, IEEE. pp. 23323–23333

work page 2024
[31]

arXiv preprint arXiv:2001.08735

Tseng,H.Y.,Lee,H.Y.,Huang,J.B.,Yang,M.H.,2020.Cross-domain few-shotclassificationvialearnedfeature-wisetransformation. arXiv preprint arXiv:2001.08735

work page arXiv 2020
[32]

On isotropy of multimodal embed- dings

Tyshchuk, K., Karpikova, P., Spiridonov, A., Prutianova, A., Razzhi- gaev, A., Panchenko, A., 2023. On isotropy of multimodal embed- dings. Information 14, 392

work page 2023
[33]

Masked embedding modeling with rapid domain adjustment for few-shot image classi- fication

Walsh, R., Osman, I.I., Shehata, M.S., 2023. Masked embedding modeling with rapid domain adjustment for few-shot image classi- fication. IEEE Trans. Image Process. , 4907–4920

work page 2023
[34]

Cross-domain few-shot classification via adversarial task augmentation

Wang, H., Deng, Z.H., 2021. Cross-domain few-shot classification via adversarial task augmentation. arXiv preprint arXiv:2104.14385

work page arXiv 2021
[35]

Wang, X., Peng, Y., Lu, L., Lu, Z., Bagheri, M., Summers, R.M.,

work page
[36]

Chestx-ray8: Hospital-scale chest x-ray database and bench- marks on weakly-supervised classification and localization of com- mon thorax diseases, in: 2017 IEEE Conference on Computer Vision and Pattern Recognition, CVPR 2017, Honolulu, HI, USA, July 21- 26, 2017, IEEE Computer Society. pp. 3462–3471

work page 2017
[37]

Flair: Vlm with fine-grained language-informed image representa- tions

Xiao, R., Kim, S., Georgescu, M.I., Akata, Z., Alaniz, S., 2024. Flair: Vlm with fine-grained language-informed image representa- tions. arXiv preprint arXiv:2412.03561

work page arXiv 2024
[38]

Enhancinginformationmaximizationwithdistance-awarecontrastive learningforsource-freecross-domainfew-shotlearning

Xu, H., Liu, L., Zhi, S., Fu, S., Su, Z., Cheng, M., Liu, Y., 2024a. Enhancinginformationmaximizationwithdistance-awarecontrastive learningforsource-freecross-domainfew-shotlearning. IEEETrans. Image Process. , 2058–2073

work page 2058
[39]

Step-wise distribution alignment guided style prompt tun- ing for source-free cross-domain few-shot learning

Xu, H., Liu, Y., Liu, L., Zhi, S., Sun, S., Liu, T., Cheng, M.M., 2024b. Step-wise distribution alignment guided style prompt tun- ing for source-free cross-domain few-shot learning. arXiv preprint arXiv:2411.10070. URL:https://arxiv.org/abs/2411.10070

work page arXiv
[40]

Mma: Multi-modal adapterforvision-languagemodels,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp

Yang, L., Zhang, R.Y., Wang, Y., Xie, X., 2024. Mma: Multi-modal adapterforvision-languagemodels,in:ProceedingsoftheIEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pp. 23826–23837

work page 2024
[41]

Yang,Y.,Deng,J.,Li,W.,Duan,L.,2025. Resclip:Residualattention fortraining-freedensevision-languageinference,in:IEEE/CVFCon- ference on Computer Vision and Pattern Recognition, CVPR 2025, Nashville,TN,USA,June11-15,2025,ComputerVisionFoundation / IEEE. pp. 29968–29978

work page 2025
[42]

Yazdanpanah, M., Moradi, P., 2022. Visual domain bridge: A source-freedomainadaptationforcross-domainfew-shotlearning,in: IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops, CVPR Workshops 2022, New Orleans, LA, USA, June 19-20, 2022, IEEE. pp. 2867–2876. URL:https://doi.org/10.1109/ CVPRW56347.2022.00324, doi:10.1109/CVPRW56347.2022.00324

work page doi:10.1109/cvprw56347.2022.00324 2022
[43]

Zanella, M., Ayed, I.B., 2024. Low-rank few-shot adaptation of vision-language models, in: IEEE/CVF Conference on Computer Vision and Pattern Recognition, CVPR 2024 - Workshops, Seattle, WA, USA, June 17-18, 2024, IEEE. pp. 1593–1603

work page 2024
[44]

Sigmoid loss for language image pre-training, in: Proceedings of the IEEE/CVF international conference on computer vision, pp

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L., 2023. Sigmoid loss for language image pre-training, in: Proceedings of the IEEE/CVF international conference on computer vision, pp. 11975–11986

work page 2023
[45]

arXiv preprint arXiv:2111.03930 , year=

Zhang, R., Fang, R., Zhang, W., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H., 2021. Tip-adapter: Training-free clip-adapter for better vision- language modeling. arXiv preprint arXiv:2111.03930

work page arXiv 2021
[46]

Tip-adapter: Training-free adaption of CLIP for few-shot classification, in: Computer Vision - ECCV 2022 - 17th European Conference, pp

Zhang, R., Zhang, W., Fang, R., Gao, P., Li, K., Dai, J., Qiao, Y., Li, H., 2022. Tip-adapter: Training-free adaption of CLIP for few-shot classification, in: Computer Vision - ECCV 2022 - 17th European Conference, pp. 493–510

work page 2022
[47]

Revisiting prototypicalnetworkforcrossdomainfew-shotlearning,in:Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp

Zhou, F., Wang, P., Zhang, L., Wei, W., Zhang, Y., 2023. Revisiting prototypicalnetworkforcrossdomainfew-shotlearning,in:Proceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 20061–20070

work page 2023
[48]

Learning to prompt for vision-language models

Zhou, K., Yang, J., Loy, C.C., Liu, Z., 2022. Learning to prompt for vision-language models. International Journal of Computer Vision 130, 2337–2348

work page 2022
[49]

Prompt as free lunch: Enhancing diversity in source-free cross-domain few-shot learning through semantic-guided prompting

Zhuo, L., Wang, Z., Fu, Y., Qian, T., 2024. Prompt as free lunch: Enhancing diversity in source-free cross-domain few-shot learning through semantic-guided prompting. arXiv preprint arXiv:2412.00767. URL:https://arxiv.org/abs/2412.00767

work page arXiv 2024
[50]

Flatten long-range loss landscapes for cross-domain few-shot learning, in: CVPR 2024, IEEE

Zou, Y., Liu, Y., Hu, Y., Li, Y., Li, R., 2024a. Flatten long-range loss landscapes for cross-domain few-shot learning, in: CVPR 2024, IEEE. pp. 23575–23584

work page 2024
[51]

Attention temperature matters in vit-based cross-domain few-shot learning

Zou, Y., Ma, R., Li, Y., Li, R., 2024b. Attention temperature matters in vit-based cross-domain few-shot learning. Advances in Neural Information Processing Systems 37, 116332–116354

work page
[52]

A closer look at the CLS token for cross-domain few-shot learning, in: NeurIPS 2024

Zou, Y., Yi, S., Li, Y., Li, R., 2024c. A closer look at the CLS token for cross-domain few-shot learning, in: NeurIPS 2024. Yaze Zhao and Yicong Liu et al.:Preprint submitted to ElsevierPage 15 of 16 Semantic Probe for SF-CDFSL Yaze Zhaoreceived the B.S. degree from the School of Computer Science and Technology, Huazhong University of Science and Technol...

work page 2024
[53]

She was a visiting scholar at the University of California, Santa Barbara

She is currently a Professor in the School of Computer Science and Technology, Huazhong University of Science and Technology. She was a visiting scholar at the University of California, Santa Barbara. She has published more than 60 journal and conference papers (NeurIPS, TKDE, SIGIR,WWW,ICDM,IJCAI).Sheisalsoasenior memberoftheChinaComputerFederation(CCF)....

work page 1997