Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

David M\'endez; Natalia D\'iaz Rodr\'iguez; Roberto Confalonieri

arxiv: 2605.22484 · v1 · pith:2ZKNSMS2new · submitted 2026-05-21 · 💻 cs.CV

Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

David M\'endez , Roberto Confalonieri , Natalia D\'iaz Rodr\'iguez This is my paper

Pith reviewed 2026-05-22 06:31 UTC · model grok-4.3

classification 💻 cs.CV

keywords vision-language alignmentsemantic prototypesclassification headsweight recyclingzero-shot learningpost-hoc alignmentcross-modal retrieval

0 comments

The pith

Classification heads from pretrained vision models can be recycled as semantic prototypes to enable zero-shot vision-language alignment and data augmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the final classification layers of standard image classifiers, normally discarded after pretraining, already embed semantic concepts that line up with text descriptions. Treating these weight vectors as fixed semantic anchors lets researchers connect image and text encoders without any new paired training data, and mixing the anchors with real image-text examples strengthens existing lightweight alignment methods. When added to current post-hoc techniques, the approach raises accuracy on cross-modal retrieval, zero-shot classification, and few-shot classification. A sympathetic reader would care because the method turns waste from ordinary vision training into a free source of semantic signal, lowering the data and compute barrier for vision-language work.

Core claim

Repurposing the supervised classification heads of pretrained vision models as semantic prototypes unlocks zero-shot alignment by using the heads directly as semantic anchors and provides a data-augmentation strategy by mixing the prototypes with real image-text pairs; integrating either use with state-of-the-art post-hoc alignment methods consistently raises performance on cross-modal retrieval and on zero- and few-shot classification tasks.

What carries the argument

The weight vectors of the final classification head, reused without retraining as semantic prototypes that serve as fixed anchors or augmentation signals.

If this is right

Post-hoc alignment pipelines gain accuracy on image-text retrieval without collecting extra paired examples.
Zero-shot and few-shot image classification improve when the prototypes are added to the alignment stage.
Training budgets for vision-language models can be reduced by substituting synthetic prototype pairs for some real pairs.
The same heads can be reused across multiple downstream alignment techniques without retraining the vision backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on non-classification vision backbones whose final layers are not supervised, to check whether the semantic alignment is specific to the classification objective.
Mixing ratios between real pairs and prototypes might be tuned per task or dataset size, potentially yielding further gains beyond the fixed schedules reported.
If the prototypes prove stable across model families, they could serve as a cheap way to inject domain-specific semantics into general-purpose VLMs.

Load-bearing premise

The classification-head weights already contain semantic concepts aligned closely enough with text embeddings that they can be used directly without any learned mapping or calibration step.

What would settle it

A controlled experiment that replaces the recycled head weights with random vectors of the same dimension and shows that the reported gains in retrieval and classification accuracy disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.22484 by David M\'endez, Natalia D\'iaz Rodr\'iguez, Roberto Confalonieri.

**Figure 1.** Figure 1: Mutual k-NN alignment (mNN ) to text representations for classification head vectors (dashed) and averaged image embeddings (solid) as a function of the number of images per class n. Multiple neighbors k ∈ {3, 5, 10} are tested. Representations are computed using the non-aligned BEiT-B/16 (Bao et al., 2021) image encoder and CLIP’s text encoder. Across all tested k and n, classification heads exhibit high… view at source ↗

**Figure 2.** Figure 2: Approach to leverage classification heads in post-hoc representation alignment. (a) Illustration of the post-hoc alignment setting, in which lightweight functions g and g are learned to map representations from independently trained image and text encoder to an image-text aligned space. (b) We recycle the classification head weights from ImageNet-21K pretraining with their class names to augment image-text… view at source ↗

**Figure 3.** Figure 3: Zero-shot classification accuracy of BEiT-B/16 (Bao et al., 2021) aligned to text with an MLP. Stacked bars show the progressive accuracy gains: base performance when only classification head weights from ImageNet-1K pretraining are used during the MLP aligner training, additional gain from using ImageNet-21K weights, and further improvement from incorporating one image-caption pair per class, for all nine… view at source ↗

**Figure 4.** Figure 4: Gains in FLICKR30K retrieval when augmenting imagetext representations Dimgtxt with classification head weights from ImageNet-21K pretraining Dweights. The x-axis shows the size of Dimgtxt. Post-hoc alignment techniques include CSA (Li et al., 2025), Text-to-Concept (Moayeri et al., 2023) and MLP alignment. Results show that augmenting with Dweights provides the largest gains in the low-data regime, with … view at source ↗

**Figure 5.** Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Mean zero-shot accuracy gains for BEiT-B/16 image encoder aligned to text on Flickr30K image-text pairs vs when augmenting alignment data with recycled ImageNet-21K weight representations. Zero-shot accuracy evaluated on: RESISC45, EuroSAT, Flowers102, OxfordPets, Food101, CIFAR-10, CIFAR-100, DTD, and Places365. The shaded region indicates the standard deviation across datasets. advantage may not stem sol… view at source ↗

**Figure 7.** Figure 7: Distribution of the cosine similarities for averaged image representations and classification weights corresponding to the ImageNet1K classes. W-W indicates the distribution of cosine similarities between weights, Img-Img between average image representations, and W-Img inter-modality cosine similarities. We evaluate our few-shot classification approach on the nine diverse classification tasks across stan… view at source ↗

**Figure 8.** Figure 8: Change in zero-shot classification (a) and retrieval (b) performance when using basic modality gap mitigation strategies. ImageNet-1K weights representations undergoing two different modality-gap mitigation strategies (Center & Rescale and Lightweight projection) are used as alignment data. The results show that explicitly mitigating the geometric modality gap did not result in better downstream performanc… view at source ↗

read the original abstract

Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting pretrained encoders through lightweight mappings, yet still demand substantial paired data. In this work, we investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes. The recycling of these weights, typically discarded after pretraining, unlocks two distinct capabilities: it enables zero-shot alignment by using weights as semantic anchors, and serves as a robust data augmentation strategy by mixing these prototypes with real image-text pairs. We demonstrate that integrating our approach with several state-of-the-art post-hoc alignment techniques consistently boosts accuracy in cross-modal retrieval, zero- and few-shot classification tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Recycling classification head weights as semantic prototypes for post-hoc VLM alignment is a practical reuse idea, but it depends on an unverified assumption that those weights already match text embeddings.

read the letter

The punchline here is that the paper shows how to repurpose the classification heads from standard supervised vision models as semantic prototypes. This recycling supposedly allows zero-shot alignment with text encoders and serves as a data augmentation trick when combined with real pairs, boosting post-hoc alignment methods. What is actually new is this particular use of the head weights for cross-modal tasks. Prior work on post-hoc alignment uses mappings or adapters, but here the idea is to directly leverage the existing weights as anchors or mix-ins. The paper does well by demonstrating that adding this component to several state-of-the-art techniques leads to accuracy improvements in cross-modal retrieval and zero- and few-shot classification. It's a practical contribution for anyone trying to align models without massive new training runs. Where it gets soft is on the assumption that these weights already carry semantic information aligned with text embeddings. Classification heads are trained on image labels, so they might represent visual features more than linguistic ones. If that's the case, the reported boosts could be due to increased effective data or regularization rather than the prototype semantics the authors highlight. The abstract reports consistent boosts but doesn't supply quantitative results, baselines, or ablation studies. This makes it difficult to assess how substantial the gains are or whether they hold up under scrutiny. This paper is for the community working on efficient ways to build or adapt vision-language models. Readers interested in reusing pretrained components for multimodal alignment would find it relevant, provided the full experiments support the claims. Overall, it deserves a serious referee to examine the details and verify if the central idea works as described.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes repurposing the classification heads of supervised pretrained vision models (typically discarded post-ImageNet training) as semantic prototypes. These weights are recycled to enable zero-shot cross-modal alignment by treating them as semantic anchors and to perform data augmentation by mixing the prototypes with real image-text pairs. The central empirical claim is that integrating this recycling strategy with existing state-of-the-art post-hoc alignment methods yields consistent accuracy improvements on cross-modal retrieval, zero-shot classification, and few-shot classification tasks.

Significance. If the claimed performance gains are reproducible and attributable to genuine semantic alignment rather than generic regularization, the work would provide a low-cost, data-efficient route to strengthen vision-language alignment without end-to-end retraining or massive paired corpora. Reusing already-computed weights could meaningfully reduce the computational barrier for post-hoc VLM adaptation.

major comments (2)

[Abstract and §3] Abstract and §3 (Method): The central claim that classification-head weights already encode concepts sufficiently aligned with a VLM text encoder to function directly as zero-shot anchors rests on an unverified cross-modal correspondence. No cosine-similarity measurements, nearest-neighbor analyses, or calibration ablations between head weights and text embeddings are reported; without such evidence the observed boosts could arise from increased effective batch size or generic regularization rather than prototype semantics.
[§4] §4 (Experiments): The abstract asserts 'consistent boosts' across multiple post-hoc methods, yet the provided description supplies no numerical deltas, baseline tables, dataset sizes, or statistical significance tests. This absence prevents assessment of effect size and reproducibility, which are load-bearing for the claim that weight recycling meaningfully advances alignment.

minor comments (2)

[§3] Clarify the precise mixing ratio and sampling procedure used when prototypes augment real image-text pairs; the current description leaves the augmentation operator underspecified.
Add a short related-work paragraph contrasting the approach with prior prototype-based or weight-reuse techniques in multimodal learning to better situate the novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment below and revised the manuscript accordingly to provide stronger evidence for our claims regarding the semantic alignment of recycled classification heads.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that classification-head weights already encode concepts sufficiently aligned with a VLM text encoder to function directly as zero-shot anchors rests on an unverified cross-modal correspondence. No cosine-similarity measurements, nearest-neighbor analyses, or calibration ablations between head weights and text embeddings are reported; without such evidence the observed boosts could arise from increased effective batch size or generic regularization rather than prototype semantics.

Authors: We agree that direct verification of cross-modal correspondence strengthens the central claim. In the revised manuscript we have added a new analysis subsection to §3 that reports cosine similarities between the recycled classification-head weights and text embeddings from the VLM text encoder for matching and non-matching classes. These measurements show systematically higher similarity for semantically corresponding pairs. We have also included nearest-neighbor retrieval examples and a calibration ablation that replaces the real weights with random vectors of identical dimension; the performance gains largely disappear under randomization, indicating that the improvements are not explained by generic regularization or batch-size effects alone. revision: yes
Referee: [§4] §4 (Experiments): The abstract asserts 'consistent boosts' across multiple post-hoc methods, yet the provided description supplies no numerical deltas, baseline tables, dataset sizes, or statistical significance tests. This absence prevents assessment of effect size and reproducibility, which are load-bearing for the claim that weight recycling meaningfully advances alignment.

Authors: We acknowledge that the original experimental section lacked sufficient quantitative detail. The revised §4 now contains expanded tables that report (i) baseline performance for each post-hoc alignment method, (ii) absolute and relative accuracy deltas when our recycling strategy is added, (iii) the exact number of image-text pairs used on each dataset, and (iv) mean and standard deviation over five independent runs with statistical significance markers. These additions allow direct evaluation of effect size and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical integration without self-referential derivations

full rationale

The paper presents an empirical proposal to repurpose pretrained classification head weights as semantic prototypes for zero-shot alignment and data augmentation, then shows performance gains when combined with existing post-hoc alignment methods on standard benchmarks. No equations, derivations, or parameter fits are described that reduce the claimed capabilities or accuracy boosts to inputs defined by the method itself. The central claims rest on external experimental validation rather than internal consistency or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that pretrained vision classification heads contain transferable semantic structure usable for text alignment; no free parameters or invented entities are mentioned in the abstract.

axioms (1)

domain assumption Pretrained vision models with classification heads exist and their weights encode category semantics.
Implicit in the proposal to repurpose those weights as prototypes.

pith-pipeline@v0.9.0 · 5679 in / 1173 out tokens · 31596 ms · 2026-05-22T06:31:24.619203+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes... Neural Collapse phenomenon... wi becomes the (scaled) class prototype
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Classification heads as semantic prototypes... zero-shot alignment by using weights as semantic anchors

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

[1]

Food-101 – mining discriminative components with random forests

Bossard, L., Guillaumin, M., and Van Gool, L. Food-101 – mining discriminative components with random forests. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.),Computer Vision – ECCV 2014, pp. 446–461, Cham,

work page 2014
[2]

Christensen, A., Mancini, M., Koepke, A., Winther, O., and Akata, Z

doi: 10.1109/JPROC.2017.2675998. Christensen, A., Mancini, M., Koepke, A., Winther, O., and Akata, Z. Image-free classifier injection for zero- shot classification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19072– 19081,

work page doi:10.1109/jproc.2017.2675998 2017
[3]

Frome, A., Corrado, G

doi: 10.1109/CVPR.2014.461. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., and Mikolov, T. Devise: A deep visual-semantic embedding model. In Burges, C., Bot- tou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.),Advances in Neural Information Processing Systems, volume

work page doi:10.1109/cvpr.2014.461 2014
[4]

Hendrycks, D

doi: 10.1109/JSTARS.2019.2918242. Hendrycks, D. and Gimpel, K. Gaussian error linear units (GELUs),

work page doi:10.1109/jstars.2019.2918242 2019
[5]

Gaussian Error Linear Units (GELUs)

URL https://arxiv.org/abs/ 1606.08415. Huh, M., Cheung, B., Wang, T., and Isola, P. Position: The platonic representation hypothesis. InForty-first International Conference on Machine Learning,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Li, P.-h., Chinchali, S

URL https://www.cs.toronto.edu/˜kriz/ learning-features-2009-TR.pdf. Li, P.-h., Chinchali, S. P., and Topcu, U. Csa: Data-efficient mapping of unimodal features to multimodal features. In International Conference on Learning Representations (ICLR),

work page 2009
[7]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907
[8]

SGDR: Stochastic Gradient Descent with Warm Restarts

URL https://arxiv.org/abs/1608.03983. Loshchilov, I. and Hutter, F. Decoupled weight decay reg- ularization. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Norelli, A., Fumero, M., Maiorca, V ., Moschella, L., Rodola, E., and Locatello, F

doi: 10.1109/ICVGIP.2008.47. Norelli, A., Fumero, M., Maiorca, V ., Moschella, L., Rodola, E., and Locatello, F. Asif: Coupled data turns unimodal models to multimodal without training.Advances in Neu- ral Information Processing Systems, 36:15303–15319,

work page doi:10.1109/icvgip.2008.47 2008
[10]

Plested, J

doi: 10.1109/CVPR.2012.6248092. Plested, J. and Gedeon, T. Deep transfer learning for image classification: a survey.arXiv preprint arXiv:2205.09904,

work page doi:10.1109/cvpr.2012.6248092 2012
[11]

URL https://doi.org/10

doi: 10.1145/3707459. URL https://doi.org/10. 1145/3707459. Tschandl, P., Rosendahl, C., and Kittler, H. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, 5(1):1–9,

work page doi:10.1145/3707459
[12]

The inaturalist species classification and detection dataset

Van Horn, G., Mac Aodha, O., Song, Y ., Cui, Y ., Sun, C., Shepard, A., Adam, H., Perona, P., and Belongie, S. The inaturalist species classification and detection dataset. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8769–8778. IEEE,

work page 2018
[13]

Zhou, K., Yang, J., Loy, C

doi: 10.1109/TPAMI.2017.2723009. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Conditional prompt learning for vision-language models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16816–16825, 2022a. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Learning to prompt for vision-language models.International Jour- nal...

work page doi:10.1109/tpami.2017.2723009 2017
[14]

The key properties for us are1: • (NC1) Variability Collapse:The feature vectors x for all training samples belonging to a class i collapse to their class mean, or centroid,µ i

describes the geometric structure of the final-layer features and classifier weights during the terminal phase of cross-entropy training (i.e., after achieving near-zero training error). The key properties for us are1: • (NC1) Variability Collapse:The feature vectors x for all training samples belonging to a class i collapse to their class mean, or centro...

work page 2024
[15]

MS-COCO is a large-scale dataset comprising over 120,000 images with diverse scenes and objects, each paired with five human-annotated captions

in the appendix. MS-COCO is a large-scale dataset comprising over 120,000 images with diverse scenes and objects, each paired with five human-annotated captions. For both datasets, we follow the standard evaluation protocol using the widely-employed Karpathy splits (Karpathy & Fei-Fei, 2015). Classification.We evaluate our approach for zero- and few-shot ...

work page 2015
[16]

We use the Places365 validation set (50 images per class) instead of the larger test set (900 images per class) for computational efficiency

is a scene-centric database designed for scene recognition and understanding. We use the Places365 validation set (50 images per class) instead of the larger test set (900 images per class) for computational efficiency. For zero-shot experiments, we evaluate on all 50 images per class. For few-shot experiments, we split each class into 10 training and 40 ...

work page 2025
[17]

Regarding the hyperparameters of CSA and text-to-concepts methods, the CSA method (Li et al.,

We anneal the learning rate following a cosine schedule without restarts (Loshchilov & Hutter, 2017). Regarding the hyperparameters of CSA and text-to-concepts methods, the CSA method (Li et al.,

work page 2017
[18]

The checkpoint for the text encoder of CLIP ViT-B/32 model is that fromtorch.hub

library. The checkpoint for the text encoder of CLIP ViT-B/32 model is that fromtorch.hub. F. Cross-modal retrieval Cross-modal retrieval operates on a set of candidate texts and images within a shared latent space. Given a query, the system retrieves the target whose embedding exhibits the highest similarity to the query. In text-to-image retrieval, the ...

work page arXiv 2075
[19]

As an additional experiment, use Flickr30K image-text pairs to align the BEiT-B/16 image encoder to text, and compare zero-shot classification accuracy when using ImageNet-21K weights to augment this alignment data. The results, illustrated in Figure 6, demonstrate that incorporating weight representations alongside image-text pairs consistently enhances ...

work page arXiv 2018
[20]

We compare post-hoc alignment on BEiT-B/16 with CLIP variants

dataset. We compare post-hoc alignment on BEiT-B/16 with CLIP variants. We observe that all models struggle with this specialized domain, yet our weight-recycling approach performs on par with the CLIP variants. Model Balanced Accuracy (%) Random 14.28 CLIP ViT-B/3218.85 CLIP ViT-L/14@336px 17.72 BEiT-B/16 (Ours) 18.32 Table 10.Zero-shot classification ac...

work page arXiv 2021
[21]

As shown in Table 12, the downstream performance is notably higher when using ImageNet-21K

library. As shown in Table 12, the downstream performance is notably higher when using ImageNet-21K. This superior performance is likely due to the greater semantic overlap between the diverse ImageNet-21K classes and the classes found in the 21 Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recyclin...

work page arXiv 2008

[1] [1]

Food-101 – mining discriminative components with random forests

Bossard, L., Guillaumin, M., and Van Gool, L. Food-101 – mining discriminative components with random forests. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.),Computer Vision – ECCV 2014, pp. 446–461, Cham,

work page 2014

[2] [2]

Christensen, A., Mancini, M., Koepke, A., Winther, O., and Akata, Z

doi: 10.1109/JPROC.2017.2675998. Christensen, A., Mancini, M., Koepke, A., Winther, O., and Akata, Z. Image-free classifier injection for zero- shot classification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19072– 19081,

work page doi:10.1109/jproc.2017.2675998 2017

[3] [3]

Frome, A., Corrado, G

doi: 10.1109/CVPR.2014.461. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., and Mikolov, T. Devise: A deep visual-semantic embedding model. In Burges, C., Bot- tou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.),Advances in Neural Information Processing Systems, volume

work page doi:10.1109/cvpr.2014.461 2014

[4] [4]

Hendrycks, D

doi: 10.1109/JSTARS.2019.2918242. Hendrycks, D. and Gimpel, K. Gaussian error linear units (GELUs),

work page doi:10.1109/jstars.2019.2918242 2019

[5] [5]

Gaussian Error Linear Units (GELUs)

URL https://arxiv.org/abs/ 1606.08415. Huh, M., Cheung, B., Wang, T., and Isola, P. Position: The platonic representation hypothesis. InForty-first International Conference on Machine Learning,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Li, P.-h., Chinchali, S

URL https://www.cs.toronto.edu/˜kriz/ learning-features-2009-TR.pdf. Li, P.-h., Chinchali, S. P., and Topcu, U. Csa: Data-efficient mapping of unimodal features to multimodal features. In International Conference on Learning Representations (ICLR),

work page 2009

[7] [7]

RoBERTa: A Robustly Optimized BERT Pretraining Approach

Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

work page internal anchor Pith review Pith/arXiv arXiv 1907

[8] [8]

SGDR: Stochastic Gradient Descent with Warm Restarts

URL https://arxiv.org/abs/1608.03983. Loshchilov, I. and Hutter, F. Decoupled weight decay reg- ularization. InInternational Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Norelli, A., Fumero, M., Maiorca, V ., Moschella, L., Rodola, E., and Locatello, F

doi: 10.1109/ICVGIP.2008.47. Norelli, A., Fumero, M., Maiorca, V ., Moschella, L., Rodola, E., and Locatello, F. Asif: Coupled data turns unimodal models to multimodal without training.Advances in Neu- ral Information Processing Systems, 36:15303–15319,

work page doi:10.1109/icvgip.2008.47 2008

[10] [10]

Plested, J

doi: 10.1109/CVPR.2012.6248092. Plested, J. and Gedeon, T. Deep transfer learning for image classification: a survey.arXiv preprint arXiv:2205.09904,

work page doi:10.1109/cvpr.2012.6248092 2012

[11] [11]

URL https://doi.org/10

doi: 10.1145/3707459. URL https://doi.org/10. 1145/3707459. Tschandl, P., Rosendahl, C., and Kittler, H. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, 5(1):1–9,

work page doi:10.1145/3707459

[12] [12]

The inaturalist species classification and detection dataset

Van Horn, G., Mac Aodha, O., Song, Y ., Cui, Y ., Sun, C., Shepard, A., Adam, H., Perona, P., and Belongie, S. The inaturalist species classification and detection dataset. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8769–8778. IEEE,

work page 2018

[13] [13]

Zhou, K., Yang, J., Loy, C

doi: 10.1109/TPAMI.2017.2723009. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Conditional prompt learning for vision-language models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16816–16825, 2022a. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Learning to prompt for vision-language models.International Jour- nal...

work page doi:10.1109/tpami.2017.2723009 2017

[14] [14]

The key properties for us are1: • (NC1) Variability Collapse:The feature vectors x for all training samples belonging to a class i collapse to their class mean, or centroid,µ i

describes the geometric structure of the final-layer features and classifier weights during the terminal phase of cross-entropy training (i.e., after achieving near-zero training error). The key properties for us are1: • (NC1) Variability Collapse:The feature vectors x for all training samples belonging to a class i collapse to their class mean, or centro...

work page 2024

[15] [15]

MS-COCO is a large-scale dataset comprising over 120,000 images with diverse scenes and objects, each paired with five human-annotated captions

in the appendix. MS-COCO is a large-scale dataset comprising over 120,000 images with diverse scenes and objects, each paired with five human-annotated captions. For both datasets, we follow the standard evaluation protocol using the widely-employed Karpathy splits (Karpathy & Fei-Fei, 2015). Classification.We evaluate our approach for zero- and few-shot ...

work page 2015

[16] [16]

We use the Places365 validation set (50 images per class) instead of the larger test set (900 images per class) for computational efficiency

is a scene-centric database designed for scene recognition and understanding. We use the Places365 validation set (50 images per class) instead of the larger test set (900 images per class) for computational efficiency. For zero-shot experiments, we evaluate on all 50 images per class. For few-shot experiments, we split each class into 10 training and 40 ...

work page 2025

[17] [17]

Regarding the hyperparameters of CSA and text-to-concepts methods, the CSA method (Li et al.,

We anneal the learning rate following a cosine schedule without restarts (Loshchilov & Hutter, 2017). Regarding the hyperparameters of CSA and text-to-concepts methods, the CSA method (Li et al.,

work page 2017

[18] [18]

The checkpoint for the text encoder of CLIP ViT-B/32 model is that fromtorch.hub

library. The checkpoint for the text encoder of CLIP ViT-B/32 model is that fromtorch.hub. F. Cross-modal retrieval Cross-modal retrieval operates on a set of candidate texts and images within a shared latent space. Given a query, the system retrieves the target whose embedding exhibits the highest similarity to the query. In text-to-image retrieval, the ...

work page arXiv 2075

[19] [19]

As an additional experiment, use Flickr30K image-text pairs to align the BEiT-B/16 image encoder to text, and compare zero-shot classification accuracy when using ImageNet-21K weights to augment this alignment data. The results, illustrated in Figure 6, demonstrate that incorporating weight representations alongside image-text pairs consistently enhances ...

work page arXiv 2018

[20] [20]

We compare post-hoc alignment on BEiT-B/16 with CLIP variants

dataset. We compare post-hoc alignment on BEiT-B/16 with CLIP variants. We observe that all models struggle with this specialized domain, yet our weight-recycling approach performs on par with the CLIP variants. Model Balanced Accuracy (%) Random 14.28 CLIP ViT-B/3218.85 CLIP ViT-L/14@336px 17.72 BEiT-B/16 (Ours) 18.32 Table 10.Zero-shot classification ac...

work page arXiv 2021

[21] [21]

As shown in Table 12, the downstream performance is notably higher when using ImageNet-21K

library. As shown in Table 12, the downstream performance is notably higher when using ImageNet-21K. This superior performance is likely due to the greater semantic overlap between the diverse ImageNet-21K classes and the classes found in the 21 Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recyclin...

work page arXiv 2008