LPT: Less-overfitting Prompt Tuning for Vision-Language Model

Chenhao Ding; Jizhou Han; Qiang Wang; SongLin Dong; Xinyuan Gao; Yihong Gong; Yuhang He; Zhengdong Zhou

arxiv: 2410.10247 · v3 · submitted 2024-10-14 · 💻 cs.CV · cs.AI

LPT: Less-overfitting Prompt Tuning for Vision-Language Model

Chenhao Ding , Xinyuan Gao , SongLin Dong , Jizhou Han , Qiang Wang , Zhengdong Zhou , Yuhang He , Yihong Gong This is my paper

Pith reviewed 2026-05-23 19:01 UTC · model grok-4.3

classification 💻 cs.CV cs.AI

keywords prompt tuningvision-language modelsoverfitting mitigationCLIP adaptationfeature space preservationgeneralizationhierarchical logit

0 comments

The pith

LPT reduces overfitting in vision-language prompt tuning by preserving the original feature space structure and constraining class outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LPT as a prompt tuning approach for models like CLIP that counters severe overfitting when adapting to downstream tasks. It filters fine-grained foreground details to focus prompts on basic visual concepts, then adds a structural preservation constraint to keep the overall feature space aligned with the frozen model and a hierarchical logit constraint to limit output class information. A sympathetic reader would care because prompt tuning is efficient for transfer yet often collapses generalization on novel classes or new domains. If the method works as described, it would let the adapted model reshape features for the task while retaining enough plasticity to handle unseen data.

Core claim

LPT claims that guiding prompts via CLIP-filtered basic concepts, combined with a feature-level Structural Preservation constraint that aligns the overall space to the frozen CLIP and an output-level Hierarchical Logit constraint that preserves class information, allows effective reshaping of the feature space during optimization without losing generalization.

What carries the argument

The LPT framework using foreground filtering plus SP constraint (feature-space alignment to frozen CLIP) and HL constraint (output class information preservation) to guide prompt learning.

If this is right

Improved accuracy on base-to-novel generalization benchmarks.
Stronger performance in cross-dataset transfer settings.
Better results under domain shifts compared to prior prompt tuning methods.
The SP and HL constraints can be applied together to complement each other at feature and output levels.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same constraints might apply to prompt tuning on other vision-language models beyond CLIP.
Shorter prompts could become viable if the constraints already limit overfitting.
The approach could be tested on non-classification tasks such as retrieval or segmentation to check if the constraints remain helpful.

Load-bearing premise

That aligning the feature space structure to the frozen CLIP and constraining class outputs will keep task-relevant information while preventing overfitting.

What would settle it

A direct comparison on base-to-novel or cross-dataset benchmarks where LPT produces lower novel-class accuracy or higher overfitting metrics than standard prompt tuning would falsify the claim.

Figures

Figures reproduced from arXiv: 2410.10247 by Chenhao Ding, Jizhou Han, Qiang Wang, SongLin Dong, Xinyuan Gao, Yihong Gong, Yuhang He, Zhengdong Zhou.

**Figure 2.** Figure 2: An overview of our method. Subfigure (a) shows the training pipeline of our methhod, where the text prompts and vision prompts [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: The effect of mask thresholds and sensitivity analysis of [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

read the original abstract

Vision-language models (VLMs) have demonstrated exceptional generalization capabilities for downstream tasks. Due to its efficiency, prompt learning has gradually become a more effective and efficient method for transferring VLMs to downstream tasks, surpassing traditional finetuning methods. However, during the transfer process, these models are prone to severe overfitting, leading to a significant decline in generalization ability. To address this issue, we propose a framework named LPT, specifically designed for vision-language models. Specifically, we use CLIP to filter out fine-grained foreground information that may lead to overfitting, thereby guiding the prompts with basic visual concepts. Additionally, to further mitigate overfitting, we have developed a Structural Preservation (SP) constraint at the feature level, which aligns the model's overall feature space structure with the frozen CLIP, endowing the feature space with overall plasticity and enabling effective reshaping of the feature space during optimization. Moreover, we employ Hierarchical Logit (HL) constraint at the output layer to constrain the overall class information in the output, complementing the role of SP at the output end. Extensive experiments across various benchmarks (from base-to-novel, cross-dataset transfer, and domain generalization) demonstrate that our approach significantly improves generalization capability and effectively alleviates overfitting compared to state-of-the-art methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LPT packages three regularization steps for prompt tuning but the claim that SP and HL preserve plasticity while cutting overfitting rests on an untested premise.

read the letter

The paper's core offering is a three-part scheme: CLIP-based foreground filtering to steer prompts toward basic visual concepts, a Structural Preservation constraint that pulls the adapted feature space back toward the frozen CLIP model, and a Hierarchical Logit constraint at the output. The authors position this as a way to reduce the overfitting that commonly appears in prompt tuning for VLMs. The combination itself is new even if the individual pieces draw from prior regularization work.

Referee Report

2 major / 0 minor

Summary. The paper proposes LPT, a prompt-tuning framework for vision-language models (e.g., CLIP) to reduce overfitting during downstream adaptation. It introduces CLIP-based filtering of fine-grained foreground information to guide prompts toward basic visual concepts, a Structural Preservation (SP) constraint that aligns the learned feature-space structure with the frozen original CLIP model, and a Hierarchical Logit (HL) constraint that regularizes class-level output information. The central claim is that these components together improve generalization on base-to-novel, cross-dataset transfer, and domain-generalization benchmarks relative to prior prompt-tuning methods.

Significance. If the empirical gains are robust and the SP/HL constraints demonstrably preserve task-relevant plasticity without over-constraining adaptation, the work would supply a practical regularization recipe for efficient VLM transfer that addresses a well-known limitation of prompt learning. The emphasis on feature-space alignment and output-level constraints is a coherent attempt to regularize at multiple scales.

major comments (2)

[Abstract] Abstract: the assertion that the SP constraint 'endows the feature space with overall plasticity' and enables effective reshaping is load-bearing for the generalization claim, yet no derivation, ablation, or counter-example is supplied showing that the alignment does not discard downstream-relevant signals or over-constrain the adaptation.
[Abstract] Abstract: the premise that HL at the output layer complements SP without discarding class-relevant information is invoked to explain reduced overfitting, but the manuscript provides no quantitative evidence (e.g., ablation removing HL or measuring retained task signal) that the combined constraints reliably improve generalization rather than simply limiting capacity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed feedback on the abstract claims. We address each major comment below, clarifying the support provided in the manuscript body while acknowledging where the abstract could be strengthened for precision. We are prepared to revise accordingly.

read point-by-point responses

Referee: [Abstract] Abstract: the assertion that the SP constraint 'endows the feature space with overall plasticity' and enables effective reshaping is load-bearing for the generalization claim, yet no derivation, ablation, or counter-example is supplied showing that the alignment does not discard downstream-relevant signals or over-constrain the adaptation.

Authors: The abstract summarizes the motivation for SP (Section 3.2), but the supporting evidence appears in the body: Section 4.3 includes ablations removing the SP term, which show degraded base-to-novel and cross-dataset performance, indicating the constraint preserves rather than discards task-relevant plasticity. We also report t-SNE visualizations comparing feature structures with and without SP. However, we agree the abstract phrasing is assertive without an inline pointer to these results; we will revise the abstract to reference the empirical support more explicitly. revision: yes
Referee: [Abstract] Abstract: the premise that HL at the output layer complements SP without discarding class-relevant information is invoked to explain reduced overfitting, but the manuscript provides no quantitative evidence (e.g., ablation removing HL or measuring retained task signal) that the combined constraints reliably improve generalization rather than simply limiting capacity.

Authors: Section 4.3 presents an ablation isolating the HL term (and the joint SP+HL configuration), demonstrating that removing HL reduces generalization gains on the reported benchmarks while the full model improves over baselines. We interpret the retained performance lift as evidence that class-relevant signal is not discarded. That said, the abstract does not cite these ablations, so the concern about missing quantitative grounding in the summary is fair. We will update the abstract to note the ablation results supporting complementarity. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical regularization framework

full rationale

The paper introduces LPT as an empirical prompt-tuning framework using SP (feature-space alignment to frozen CLIP) and HL (logit constraint) to mitigate overfitting. Claims rest on experimental results across base-to-novel, cross-dataset, and domain-generalization benchmarks rather than any closed-form derivation or prediction. No equations reduce a claimed result to a fitted parameter by construction, no self-citation chain justifies a uniqueness theorem, and no ansatz is smuggled via prior work. The method is presented as a regularization technique whose value is assessed externally via benchmarks; the central generalization improvements are not tautological with the input constraints.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The method rests on the empirical claim that the proposed constraints improve generalization; no explicit free parameters, axioms, or invented entities are stated in the abstract, though balancing weights for SP and HL are implicitly required.

pith-pipeline@v0.9.0 · 5776 in / 1149 out tokens · 19858 ms · 2026-05-23T19:01:47.734196+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

45 extracted references · 45 canonical work pages · 4 internal anchors

[1]

Bridg- ing the gap between object and image-level representations for open-vocabulary detection

Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shahbaz Khan. Bridg- ing the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Informa- tion Processing Systems, 35:33781–33794, 2022. 2

work page 2022
[2]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014. 5

work page 2014
[3]

Cross-layer distillation with semantic calibration

Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI con- ference on artificial intelligence, pages 7028–7036, 2021. 3

work page 2021
[4]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5

work page 2014
[5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

work page 2009
[6]

De- coupling zero-shot semantic segmentation

Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022. 1

work page 2022
[7]

Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5

work page 2004
[8]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 5

work page 2019
[9]

The many faces of robust- ness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021. 5

work page 2021
[10]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- hardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15262–15271, 2021. 5

work page 2021
[11]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2015
[12]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,

work page
[13]

Multi-level logit distil- lation

Ying Jin, Jiaqi Wang, and Dahua Lin. Multi-level logit distil- lation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 24276–24285,

work page
[14]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023. 1, 2

work page 2023
[15]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzam- mal Naseer, Salman Khan, Ming-Hsuan Yang, and Fa- had Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15190–15200, 2023. 1, 2

work page 2023
[16]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 5

work page 2013
[17]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation. arXiv preprint arXiv:2201.03546, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022
[18]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Interna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 2

work page 2022
[19]

Promptkd: Unsupervised prompt distillation for vision-language models

Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26617–26626, 2024. 3

work page 2024
[20]

Class-agnostic object detection with multi- modal transformer

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fa- had Shahbaz Khan, Rao Muhammad Anwer, and Ming- Hsuan Yang. Class-agnostic object detection with multi- modal transformer. In European conference on computer vision, pages 512–531. Springer, 2022. 1

work page 2022
[21]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft. arXiv preprint arXiv:1306.5151 , 2013. 5

work page internal anchor Pith review Pith/arXiv arXiv 2013
[22]

Im- proved knowledge distillation via teacher assistant

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Im- proved knowledge distillation via teacher assistant. In Pro- ceedings of the AAAI conference on artificial intelligence , pages 5191–5198, 2020. 3

work page 2020
[23]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 5

work page 2008
[24]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 2

work page 2008
[25]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 5

work page 2012
[26]

Learning deep rep- resentations with probabilistic knowledge transfer

Nikolaos Passalis and Anastasios Tefas. Learning deep rep- resentations with probabilistic knowledge transfer. In Pro- ceedings of the European Conference on Computer Vision (ECCV), pages 268–284, 2018. 3

work page 2018
[27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 3

work page 2021
[28]

Denseclip: Language-guided dense prediction with context- aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 18082–18091, 2022. 2

work page 2022
[29]

Fine-tuned clip models are efficient video learners

Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6545–6554, 2023. 2, 3, 5

work page 2023
[30]

Do imagenet classifiers generalize to im- agenet? In International conference on machine learning , pages 5389–5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to im- agenet? In International conference on machine learning , pages 5389–5400. PMLR, 2019. 5

work page 2019
[31]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 ,

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Multi- mode online knowledge distillation for self-supervised visual representation learning

Kaiyou Song, Jin Xie, Shan Zhang, and Zimeng Luo. Multi- mode online knowledge distillation for self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 11848–11857, 2023. 3

work page 2023
[33]

A dataset of 101 human action classes from videos in the wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision , 2(11):1–7,

work page
[34]

Argue: Attribute-guided prompt tuning for vision-language models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. Argue: Attribute-guided prompt tuning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 28578– 28587, 2024. 2

work page 2024
[35]

Learning robust global representations by penalizing local predictive power.Advances in Neural Information Pro- cessing Systems, 32, 2019

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power.Advances in Neural Information Pro- cessing Systems, 32, 2019. 5

work page 2019
[36]

Learning hierarchical prompt with structured linguistic knowledge for vision-language models

Yubin Wang, Xinyang Jiang, De Cheng, Dongsheng Li, and Cairong Zhao. Learning hierarchical prompt with structured linguistic knowledge for vision-language models. In Pro- ceedings of the AAAI Conference on Artificial Intelligence , pages 5749–5757, 2024. 2

work page 2024
[37]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer so- ciety conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010. 5

work page 2010
[38]

Visual- language prompt tuning with knowledge-guided context op- timization

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual- language prompt tuning with knowledge-guided context op- timization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767,

work page
[39]

Tcp: Textual- based class-aware prompt tuning for visual-language model

Hantao Yao, Rui Zhang, and Changsheng Xu. Tcp: Textual- based class-aware prompt tuning for visual-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23438–23448, 2024. 1, 2, 6

work page 2024
[40]

Filip: Fine-grained interactive language-image pre-training

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021. 2

work page arXiv 2021
[41]

Open-vocabulary detr with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. In European Conference on Computer Vision , pages 106–122. Springer, 2022. 2

work page 2022
[42]

Concept-guided prompt learning for generalization in vision- language models

Yi Zhang, Ce Zhang, Ke Yu, Yushun Tang, and Zhihai He. Concept-guided prompt learning for generalization in vision- language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7377–7386, 2024. 2

work page 2024
[43]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 16816–16825,

work page
[44]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. In- ternational Journal of Computer Vision, 130(9):2337–2348,

work page
[45]

Detecting twenty-thousand classes using image-level supervision

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr¨ahenb¨uhl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision. In European Confer- ence on Computer Vision , pages 350–368. Springer, 2022. 2

work page 2022

[1] [1]

Bridg- ing the gap between object and image-level representations for open-vocabulary detection

Hanoona Bangalath, Muhammad Maaz, Muhammad Uzair Khattak, Salman H Khan, and Fahad Shahbaz Khan. Bridg- ing the gap between object and image-level representations for open-vocabulary detection. Advances in Neural Informa- tion Processing Systems, 35:33781–33794, 2022. 2

work page 2022

[2] [2]

Food-101–mining discriminative components with random forests

Lukas Bossard, Matthieu Guillaumin, and Luc Van Gool. Food-101–mining discriminative components with random forests. In Computer Vision–ECCV 2014: 13th European Conference, Zurich, Switzerland, September 6-12, 2014, Proceedings, Part VI 13, pages 446–461. Springer, 2014. 5

work page 2014

[3] [3]

Cross-layer distillation with semantic calibration

Defang Chen, Jian-Ping Mei, Yuan Zhang, Can Wang, Zhe Wang, Yan Feng, and Chun Chen. Cross-layer distillation with semantic calibration. In Proceedings of the AAAI con- ference on artificial intelligence, pages 7028–7036, 2021. 3

work page 2021

[4] [4]

Describing textures in the wild

Mircea Cimpoi, Subhransu Maji, Iasonas Kokkinos, Sammy Mohamed, and Andrea Vedaldi. Describing textures in the wild. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages 3606–3613, 2014. 5

work page 2014

[5] [5]

Imagenet: A large-scale hierarchical image database

Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li, and Li Fei-Fei. Imagenet: A large-scale hierarchical image database. In 2009 IEEE conference on computer vision and pattern recognition, pages 248–255. Ieee, 2009. 5

work page 2009

[6] [6]

De- coupling zero-shot semantic segmentation

Jian Ding, Nan Xue, Gui-Song Xia, and Dengxin Dai. De- coupling zero-shot semantic segmentation. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 11583–11592, 2022. 1

work page 2022

[7] [7]

Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories

Li Fei-Fei, Rob Fergus, and Pietro Perona. Learning gener- ative visual models from few training examples: An incre- mental bayesian approach tested on 101 object categories. In 2004 conference on computer vision and pattern recognition workshop, pages 178–178. IEEE, 2004. 5

work page 2004

[8] [8]

Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification

Patrick Helber, Benjamin Bischke, Andreas Dengel, and Damian Borth. Eurosat: A novel dataset and deep learning benchmark for land use and land cover classification. IEEE Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 12(7):2217–2226, 2019. 5

work page 2019

[9] [9]

The many faces of robust- ness: A critical analysis of out-of-distribution generalization

Dan Hendrycks, Steven Basart, Norman Mu, Saurav Kada- vath, Frank Wang, Evan Dorundo, Rahul Desai, Tyler Zhu, Samyak Parajuli, Mike Guo, et al. The many faces of robust- ness: A critical analysis of out-of-distribution generalization. In Proceedings of the IEEE/CVF international conference on computer vision, pages 8340–8349, 2021. 5

work page 2021

[10] [10]

Natural adversarial examples

Dan Hendrycks, Kevin Zhao, Steven Basart, Jacob Stein- hardt, and Dawn Song. Natural adversarial examples. In Proceedings of the IEEE/CVF conference on computer vi- sion and pattern recognition, pages 15262–15271, 2021. 5

work page 2021

[11] [11]

Distilling the Knowledge in a Neural Network

Geoffrey Hinton, Oriol Vinyals, and Jeff Dean. Distill- ing the knowledge in a neural network. arXiv preprint arXiv:1503.02531, 2015. 3, 5

work page internal anchor Pith review Pith/arXiv arXiv 2015

[12] [12]

Scaling up visual and vision-language representa- tion learning with noisy text supervision

Chao Jia, Yinfei Yang, Ye Xia, Yi-Ting Chen, Zarana Parekh, Hieu Pham, Quoc Le, Yun-Hsuan Sung, Zhen Li, and Tom Duerig. Scaling up visual and vision-language representa- tion learning with noisy text supervision. In International conference on machine learning, pages 4904–4916. PMLR,

work page

[13] [13]

Multi-level logit distil- lation

Ying Jin, Jiaqi Wang, and Dahua Lin. Multi-level logit distil- lation. In Proceedings of the IEEE/CVF Conference on Com- puter Vision and Pattern Recognition , pages 24276–24285,

work page

[14] [14]

Maple: Multi-modal prompt learning

Muhammad Uzair Khattak, Hanoona Rasheed, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Maple: Multi-modal prompt learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 19113–19122, 2023. 1, 2

work page 2023

[15] [15]

Self-regulating prompts: Foundational model adaptation without forgetting

Muhammad Uzair Khattak, Syed Talal Wasim, Muzam- mal Naseer, Salman Khan, Ming-Hsuan Yang, and Fa- had Shahbaz Khan. Self-regulating prompts: Foundational model adaptation without forgetting. In Proceedings of the IEEE/CVF International Conference on Computer Vision , pages 15190–15200, 2023. 1, 2

work page 2023

[16] [16]

3d object representations for fine-grained categorization

Jonathan Krause, Michael Stark, Jia Deng, and Li Fei-Fei. 3d object representations for fine-grained categorization. In Proceedings of the IEEE international conference on com- puter vision workshops, pages 554–561, 2013. 5

work page 2013

[17] [17]

Language-driven Semantic Segmentation

Boyi Li, Kilian Q Weinberger, Serge Belongie, Vladlen Koltun, and Ren ´e Ranftl. Language-driven semantic seg- mentation. arXiv preprint arXiv:2201.03546, 2022. 1

work page internal anchor Pith review Pith/arXiv arXiv 2022

[18] [18]

Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation

Junnan Li, Dongxu Li, Caiming Xiong, and Steven Hoi. Blip: Bootstrapping language-image pre-training for unified vision-language understanding and generation. In Interna- tional conference on machine learning, pages 12888–12900. PMLR, 2022. 2

work page 2022

[19] [19]

Promptkd: Unsupervised prompt distillation for vision-language models

Zheng Li, Xiang Li, Xinyi Fu, Xin Zhang, Weiqiang Wang, Shuo Chen, and Jian Yang. Promptkd: Unsupervised prompt distillation for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 26617–26626, 2024. 3

work page 2024

[20] [20]

Class-agnostic object detection with multi- modal transformer

Muhammad Maaz, Hanoona Rasheed, Salman Khan, Fa- had Shahbaz Khan, Rao Muhammad Anwer, and Ming- Hsuan Yang. Class-agnostic object detection with multi- modal transformer. In European conference on computer vision, pages 512–531. Springer, 2022. 1

work page 2022

[21] [21]

Fine-Grained Visual Classification of Aircraft

Subhransu Maji, Esa Rahtu, Juho Kannala, Matthew Blaschko, and Andrea Vedaldi. Fine-grained visual classi- fication of aircraft. arXiv preprint arXiv:1306.5151 , 2013. 5

work page internal anchor Pith review Pith/arXiv arXiv 2013

[22] [22]

Im- proved knowledge distillation via teacher assistant

Seyed Iman Mirzadeh, Mehrdad Farajtabar, Ang Li, Nir Levine, Akihiro Matsukawa, and Hassan Ghasemzadeh. Im- proved knowledge distillation via teacher assistant. In Pro- ceedings of the AAAI conference on artificial intelligence , pages 5191–5198, 2020. 3

work page 2020

[23] [23]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 5

work page 2008

[24] [24]

Automated flower classification over a large number of classes

Maria-Elena Nilsback and Andrew Zisserman. Automated flower classification over a large number of classes. In 2008 Sixth Indian conference on computer vision, graphics & im- age processing, pages 722–729. IEEE, 2008. 2

work page 2008

[25] [25]

Cats and dogs

Omkar M Parkhi, Andrea Vedaldi, Andrew Zisserman, and CV Jawahar. Cats and dogs. In 2012 IEEE conference on computer vision and pattern recognition, pages 3498–3505. IEEE, 2012. 5

work page 2012

[26] [26]

Learning deep rep- resentations with probabilistic knowledge transfer

Nikolaos Passalis and Anastasios Tefas. Learning deep rep- resentations with probabilistic knowledge transfer. In Pro- ceedings of the European Conference on Computer Vision (ECCV), pages 268–284, 2018. 3

work page 2018

[27] [27]

Learning transferable visual models from natural language supervi- sion

Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, et al. Learning transferable visual models from natural language supervi- sion. In International conference on machine learning, pages 8748–8763. PMLR, 2021. 1, 2, 3

work page 2021

[28] [28]

Denseclip: Language-guided dense prediction with context- aware prompting

Yongming Rao, Wenliang Zhao, Guangyi Chen, Yansong Tang, Zheng Zhu, Guan Huang, Jie Zhou, and Jiwen Lu. Denseclip: Language-guided dense prediction with context- aware prompting. In Proceedings of the IEEE/CVF con- ference on computer vision and pattern recognition , pages 18082–18091, 2022. 2

work page 2022

[29] [29]

Fine-tuned clip models are efficient video learners

Hanoona Rasheed, Muhammad Uzair Khattak, Muhammad Maaz, Salman Khan, and Fahad Shahbaz Khan. Fine-tuned clip models are efficient video learners. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6545–6554, 2023. 2, 3, 5

work page 2023

[30] [30]

Do imagenet classifiers generalize to im- agenet? In International conference on machine learning , pages 5389–5400

Benjamin Recht, Rebecca Roelofs, Ludwig Schmidt, and Vaishaal Shankar. Do imagenet classifiers generalize to im- agenet? In International conference on machine learning , pages 5389–5400. PMLR, 2019. 5

work page 2019

[31] [31]

FitNets: Hints for Thin Deep Nets

Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua Bengio. Fitnets: Hints for thin deep nets. arXiv preprint arXiv:1412.6550 ,

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Multi- mode online knowledge distillation for self-supervised visual representation learning

Kaiyou Song, Jin Xie, Shan Zhang, and Zimeng Luo. Multi- mode online knowledge distillation for self-supervised visual representation learning. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 11848–11857, 2023. 3

work page 2023

[33] [33]

A dataset of 101 human action classes from videos in the wild

Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. A dataset of 101 human action classes from videos in the wild. Center for Research in Computer Vision , 2(11):1–7,

work page

[34] [34]

Argue: Attribute-guided prompt tuning for vision-language models

Xinyu Tian, Shu Zou, Zhaoyuan Yang, and Jing Zhang. Argue: Attribute-guided prompt tuning for vision-language models. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition , pages 28578– 28587, 2024. 2

work page 2024

[35] [35]

Learning robust global representations by penalizing local predictive power.Advances in Neural Information Pro- cessing Systems, 32, 2019

Haohan Wang, Songwei Ge, Zachary Lipton, and Eric P Xing. Learning robust global representations by penalizing local predictive power.Advances in Neural Information Pro- cessing Systems, 32, 2019. 5

work page 2019

[36] [36]

Learning hierarchical prompt with structured linguistic knowledge for vision-language models

Yubin Wang, Xinyang Jiang, De Cheng, Dongsheng Li, and Cairong Zhao. Learning hierarchical prompt with structured linguistic knowledge for vision-language models. In Pro- ceedings of the AAAI Conference on Artificial Intelligence , pages 5749–5757, 2024. 2

work page 2024

[37] [37]

Sun database: Large-scale scene recognition from abbey to zoo

Jianxiong Xiao, James Hays, Krista A Ehinger, Aude Oliva, and Antonio Torralba. Sun database: Large-scale scene recognition from abbey to zoo. In 2010 IEEE computer so- ciety conference on computer vision and pattern recognition, pages 3485–3492. IEEE, 2010. 5

work page 2010

[38] [38]

Visual- language prompt tuning with knowledge-guided context op- timization

Hantao Yao, Rui Zhang, and Changsheng Xu. Visual- language prompt tuning with knowledge-guided context op- timization. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 6757–6767,

work page

[39] [39]

Tcp: Textual- based class-aware prompt tuning for visual-language model

Hantao Yao, Rui Zhang, and Changsheng Xu. Tcp: Textual- based class-aware prompt tuning for visual-language model. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 23438–23448, 2024. 1, 2, 6

work page 2024

[40] [40]

Filip: Fine-grained interactive language-image pre-training

Lewei Yao, Runhui Huang, Lu Hou, Guansong Lu, Minzhe Niu, Hang Xu, Xiaodan Liang, Zhenguo Li, Xin Jiang, and Chunjing Xu. Filip: Fine-grained interactive language-image pre-training. arXiv preprint arXiv:2111.07783, 2021. 2

work page arXiv 2021

[41] [41]

Open-vocabulary detr with conditional matching

Yuhang Zang, Wei Li, Kaiyang Zhou, Chen Huang, and Chen Change Loy. Open-vocabulary detr with conditional matching. In European Conference on Computer Vision , pages 106–122. Springer, 2022. 2

work page 2022

[42] [42]

Concept-guided prompt learning for generalization in vision- language models

Yi Zhang, Ce Zhang, Ke Yu, Yushun Tang, and Zhihai He. Concept-guided prompt learning for generalization in vision- language models. In Proceedings of the AAAI Conference on Artificial Intelligence, pages 7377–7386, 2024. 2

work page 2024

[43] [43]

Conditional prompt learning for vision-language mod- els

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Conditional prompt learning for vision-language mod- els. In Proceedings of the IEEE/CVF conference on com- puter vision and pattern recognition , pages 16816–16825,

work page

[44] [44]

Learning to prompt for vision-language models

Kaiyang Zhou, Jingkang Yang, Chen Change Loy, and Ziwei Liu. Learning to prompt for vision-language models. In- ternational Journal of Computer Vision, 130(9):2337–2348,

work page

[45] [45]

Detecting twenty-thousand classes using image-level supervision

Xingyi Zhou, Rohit Girdhar, Armand Joulin, Philipp Kr¨ahenb¨uhl, and Ishan Misra. Detecting twenty-thousand classes using image-level supervision. In European Confer- ence on Computer Vision , pages 350–368. Springer, 2022. 2

work page 2022