pith. sign in

arxiv: 2605.22484 · v1 · pith:2ZKNSMS2new · submitted 2026-05-21 · 💻 cs.CV

Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recycling

Pith reviewed 2026-05-22 06:31 UTC · model grok-4.3

classification 💻 cs.CV
keywords vision-language alignmentsemantic prototypesclassification headsweight recyclingzero-shot learningpost-hoc alignmentcross-modal retrieval
0
0 comments X

The pith

Classification heads from pretrained vision models can be recycled as semantic prototypes to enable zero-shot vision-language alignment and data augmentation.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that the final classification layers of standard image classifiers, normally discarded after pretraining, already embed semantic concepts that line up with text descriptions. Treating these weight vectors as fixed semantic anchors lets researchers connect image and text encoders without any new paired training data, and mixing the anchors with real image-text examples strengthens existing lightweight alignment methods. When added to current post-hoc techniques, the approach raises accuracy on cross-modal retrieval, zero-shot classification, and few-shot classification. A sympathetic reader would care because the method turns waste from ordinary vision training into a free source of semantic signal, lowering the data and compute barrier for vision-language work.

Core claim

Repurposing the supervised classification heads of pretrained vision models as semantic prototypes unlocks zero-shot alignment by using the heads directly as semantic anchors and provides a data-augmentation strategy by mixing the prototypes with real image-text pairs; integrating either use with state-of-the-art post-hoc alignment methods consistently raises performance on cross-modal retrieval and on zero- and few-shot classification tasks.

What carries the argument

The weight vectors of the final classification head, reused without retraining as semantic prototypes that serve as fixed anchors or augmentation signals.

If this is right

  • Post-hoc alignment pipelines gain accuracy on image-text retrieval without collecting extra paired examples.
  • Zero-shot and few-shot image classification improve when the prototypes are added to the alignment stage.
  • Training budgets for vision-language models can be reduced by substituting synthetic prototype pairs for some real pairs.
  • The same heads can be reused across multiple downstream alignment techniques without retraining the vision backbone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could be tested on non-classification vision backbones whose final layers are not supervised, to check whether the semantic alignment is specific to the classification objective.
  • Mixing ratios between real pairs and prototypes might be tuned per task or dataset size, potentially yielding further gains beyond the fixed schedules reported.
  • If the prototypes prove stable across model families, they could serve as a cheap way to inject domain-specific semantics into general-purpose VLMs.

Load-bearing premise

The classification-head weights already contain semantic concepts aligned closely enough with text embeddings that they can be used directly without any learned mapping or calibration step.

What would settle it

A controlled experiment that replaces the recycled head weights with random vectors of the same dimension and shows that the reported gains in retrieval and classification accuracy disappear or reverse.

Figures

Figures reproduced from arXiv: 2605.22484 by David M\'endez, Natalia D\'iaz Rodr\'iguez, Roberto Confalonieri.

Figure 1
Figure 1. Figure 1: Mutual k-NN alignment (mNN ) to text representations for classification head vectors (dashed) and averaged image em￾beddings (solid) as a function of the number of images per class n. Multiple neighbors k ∈ {3, 5, 10} are tested. Representations are computed using the non-aligned BEiT-B/16 (Bao et al., 2021) image encoder and CLIP’s text encoder. Across all tested k and n, classification heads exhibit high… view at source ↗
Figure 2
Figure 2. Figure 2: Approach to leverage classification heads in post-hoc representation alignment. (a) Illustration of the post-hoc alignment setting, in which lightweight functions g and g are learned to map representations from independently trained image and text encoder to an image-text aligned space. (b) We recycle the classification head weights from ImageNet-21K pretraining with their class names to augment image-text… view at source ↗
Figure 3
Figure 3. Figure 3: Zero-shot classification accuracy of BEiT-B/16 (Bao et al., 2021) aligned to text with an MLP. Stacked bars show the progressive accuracy gains: base performance when only classification head weights from ImageNet-1K pretraining are used during the MLP aligner training, additional gain from using ImageNet-21K weights, and further improvement from incorporating one image-caption pair per class, for all nine… view at source ↗
Figure 4
Figure 4. Figure 4: Gains in FLICKR30K retrieval when augmenting image￾text representations Dimgtxt with classification head weights from ImageNet-21K pretraining Dweights. The x-axis shows the size of Dimgtxt. Post-hoc alignment techniques include CSA (Li et al., 2025), Text-to-Concept (Moayeri et al., 2023) and MLP alignment. Results show that augmenting with Dweights provides the largest gains in the low-data regime, with … view at source ↗
Figure 5
Figure 5. Figure 5 [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Mean zero-shot accuracy gains for BEiT-B/16 image encoder aligned to text on Flickr30K image-text pairs vs when augmenting alignment data with recycled ImageNet-21K weight representations. Zero-shot accuracy evaluated on: RESISC45, EuroSAT, Flowers102, OxfordPets, Food101, CIFAR-10, CIFAR-100, DTD, and Places365. The shaded region indicates the standard deviation across datasets. advantage may not stem sol… view at source ↗
Figure 7
Figure 7. Figure 7: Distribution of the cosine similarities for averaged image representations and classification weights corresponding to the Ima￾geNet1K classes. W-W indicates the distribution of cosine similarities between weights, Img-Img between average image representations, and W-Img inter-modality cosine similarities. We evaluate our few-shot classification approach on the nine diverse classification tasks across stan… view at source ↗
Figure 8
Figure 8. Figure 8: Change in zero-shot classification (a) and retrieval (b) performance when using basic modality gap mitigation strategies. ImageNet-1K weights representations undergoing two different modality-gap mitigation strategies (Center & Rescale and Lightweight projection) are used as alignment data. The results show that explicitly mitigating the geometric modality gap did not result in better downstream performanc… view at source ↗
read the original abstract

Vision-Language Models (VLMs) excel at tasks like zero-shot classification and cross-modal retrieval by mapping images and text to a shared space, but this requires expensive end-to-end training with massive paired datasets. Current post-hoc alignment methods reduce computational costs by connecting pretrained encoders through lightweight mappings, yet still demand substantial paired data. In this work, we investigate the potential of repurposing the classification heads of pretrained vision models as semantic prototypes. The recycling of these weights, typically discarded after pretraining, unlocks two distinct capabilities: it enables zero-shot alignment by using weights as semantic anchors, and serves as a robust data augmentation strategy by mixing these prototypes with real image-text pairs. We demonstrate that integrating our approach with several state-of-the-art post-hoc alignment techniques consistently boosts accuracy in cross-modal retrieval, zero- and few-shot classification tasks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript proposes repurposing the classification heads of supervised pretrained vision models (typically discarded post-ImageNet training) as semantic prototypes. These weights are recycled to enable zero-shot cross-modal alignment by treating them as semantic anchors and to perform data augmentation by mixing the prototypes with real image-text pairs. The central empirical claim is that integrating this recycling strategy with existing state-of-the-art post-hoc alignment methods yields consistent accuracy improvements on cross-modal retrieval, zero-shot classification, and few-shot classification tasks.

Significance. If the claimed performance gains are reproducible and attributable to genuine semantic alignment rather than generic regularization, the work would provide a low-cost, data-efficient route to strengthen vision-language alignment without end-to-end retraining or massive paired corpora. Reusing already-computed weights could meaningfully reduce the computational barrier for post-hoc VLM adaptation.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Method): The central claim that classification-head weights already encode concepts sufficiently aligned with a VLM text encoder to function directly as zero-shot anchors rests on an unverified cross-modal correspondence. No cosine-similarity measurements, nearest-neighbor analyses, or calibration ablations between head weights and text embeddings are reported; without such evidence the observed boosts could arise from increased effective batch size or generic regularization rather than prototype semantics.
  2. [§4] §4 (Experiments): The abstract asserts 'consistent boosts' across multiple post-hoc methods, yet the provided description supplies no numerical deltas, baseline tables, dataset sizes, or statistical significance tests. This absence prevents assessment of effect size and reproducibility, which are load-bearing for the claim that weight recycling meaningfully advances alignment.
minor comments (2)
  1. [§3] Clarify the precise mixing ratio and sampling procedure used when prototypes augment real image-text pairs; the current description leaves the augmentation operator underspecified.
  2. Add a short related-work paragraph contrasting the approach with prior prototype-based or weight-reuse techniques in multimodal learning to better situate the novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We have addressed each major comment below and revised the manuscript accordingly to provide stronger evidence for our claims regarding the semantic alignment of recycled classification heads.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Method): The central claim that classification-head weights already encode concepts sufficiently aligned with a VLM text encoder to function directly as zero-shot anchors rests on an unverified cross-modal correspondence. No cosine-similarity measurements, nearest-neighbor analyses, or calibration ablations between head weights and text embeddings are reported; without such evidence the observed boosts could arise from increased effective batch size or generic regularization rather than prototype semantics.

    Authors: We agree that direct verification of cross-modal correspondence strengthens the central claim. In the revised manuscript we have added a new analysis subsection to §3 that reports cosine similarities between the recycled classification-head weights and text embeddings from the VLM text encoder for matching and non-matching classes. These measurements show systematically higher similarity for semantically corresponding pairs. We have also included nearest-neighbor retrieval examples and a calibration ablation that replaces the real weights with random vectors of identical dimension; the performance gains largely disappear under randomization, indicating that the improvements are not explained by generic regularization or batch-size effects alone. revision: yes

  2. Referee: [§4] §4 (Experiments): The abstract asserts 'consistent boosts' across multiple post-hoc methods, yet the provided description supplies no numerical deltas, baseline tables, dataset sizes, or statistical significance tests. This absence prevents assessment of effect size and reproducibility, which are load-bearing for the claim that weight recycling meaningfully advances alignment.

    Authors: We acknowledge that the original experimental section lacked sufficient quantitative detail. The revised §4 now contains expanded tables that report (i) baseline performance for each post-hoc alignment method, (ii) absolute and relative accuracy deltas when our recycling strategy is added, (iii) the exact number of image-text pairs used on each dataset, and (iv) mean and standard deviation over five independent runs with statistical significance markers. These additions allow direct evaluation of effect size and reproducibility. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical integration without self-referential derivations

full rationale

The paper presents an empirical proposal to repurpose pretrained classification head weights as semantic prototypes for zero-shot alignment and data augmentation, then shows performance gains when combined with existing post-hoc alignment methods on standard benchmarks. No equations, derivations, or parameter fits are described that reduce the claimed capabilities or accuracy boosts to inputs defined by the method itself. The central claims rest on external experimental validation rather than internal consistency or self-citation chains.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the unstated premise that pretrained vision classification heads contain transferable semantic structure usable for text alignment; no free parameters or invented entities are mentioned in the abstract.

axioms (1)
  • domain assumption Pretrained vision models with classification heads exist and their weights encode category semantics.
    Implicit in the proposal to repurpose those weights as prototypes.

pith-pipeline@v0.9.0 · 5679 in / 1173 out tokens · 31596 ms · 2026-05-22T06:31:24.619203+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 3 internal anchors

  1. [1]

    Food-101 – mining discriminative components with random forests

    Bossard, L., Guillaumin, M., and Van Gool, L. Food-101 – mining discriminative components with random forests. In Fleet, D., Pajdla, T., Schiele, B., and Tuytelaars, T. (eds.),Computer Vision – ECCV 2014, pp. 446–461, Cham,

  2. [2]

    Christensen, A., Mancini, M., Koepke, A., Winther, O., and Akata, Z

    doi: 10.1109/JPROC.2017.2675998. Christensen, A., Mancini, M., Koepke, A., Winther, O., and Akata, Z. Image-free classifier injection for zero- shot classification. InProceedings of the IEEE/CVF International Conference on Computer Vision, pp. 19072– 19081,

  3. [3]

    Frome, A., Corrado, G

    doi: 10.1109/CVPR.2014.461. Frome, A., Corrado, G. S., Shlens, J., Bengio, S., Dean, J., Ranzato, M. A., and Mikolov, T. Devise: A deep visual-semantic embedding model. In Burges, C., Bot- tou, L., Welling, M., Ghahramani, Z., and Weinberger, K. (eds.),Advances in Neural Information Processing Systems, volume

  4. [4]

    Hendrycks, D

    doi: 10.1109/JSTARS.2019.2918242. Hendrycks, D. and Gimpel, K. Gaussian error linear units (GELUs),

  5. [5]

    Gaussian Error Linear Units (GELUs)

    URL https://arxiv.org/abs/ 1606.08415. Huh, M., Cheung, B., Wang, T., and Isola, P. Position: The platonic representation hypothesis. InForty-first International Conference on Machine Learning,

  6. [6]

    Li, P.-h., Chinchali, S

    URL https://www.cs.toronto.edu/˜kriz/ learning-features-2009-TR.pdf. Li, P.-h., Chinchali, S. P., and Topcu, U. Csa: Data-efficient mapping of unimodal features to multimodal features. In International Conference on Learning Representations (ICLR),

  7. [7]

    RoBERTa: A Robustly Optimized BERT Pretraining Approach

    Liu, Y ., Ott, M., Goyal, N., Du, J., Joshi, M., Chen, D., Levy, O., Lewis, M., Zettlemoyer, L., and Stoyanov, V . Roberta: A robustly optimized bert pretraining approach. arXiv preprint arXiv:1907.11692,

  8. [8]

    SGDR: Stochastic Gradient Descent with Warm Restarts

    URL https://arxiv.org/abs/1608.03983. Loshchilov, I. and Hutter, F. Decoupled weight decay reg- ularization. InInternational Conference on Learning Representations,

  9. [9]

    Norelli, A., Fumero, M., Maiorca, V ., Moschella, L., Rodola, E., and Locatello, F

    doi: 10.1109/ICVGIP.2008.47. Norelli, A., Fumero, M., Maiorca, V ., Moschella, L., Rodola, E., and Locatello, F. Asif: Coupled data turns unimodal models to multimodal without training.Advances in Neu- ral Information Processing Systems, 36:15303–15319,

  10. [10]

    Plested, J

    doi: 10.1109/CVPR.2012.6248092. Plested, J. and Gedeon, T. Deep transfer learning for image classification: a survey.arXiv preprint arXiv:2205.09904,

  11. [11]

    URL https://doi.org/10

    doi: 10.1145/3707459. URL https://doi.org/10. 1145/3707459. Tschandl, P., Rosendahl, C., and Kittler, H. The ham10000 dataset, a large collection of multi-source dermatoscopic images of common pigmented skin lesions.Scientific data, 5(1):1–9,

  12. [12]

    The inaturalist species classification and detection dataset

    Van Horn, G., Mac Aodha, O., Song, Y ., Cui, Y ., Sun, C., Shepard, A., Adam, H., Perona, P., and Belongie, S. The inaturalist species classification and detection dataset. In 2018 IEEE/CVF Conference on Computer Vision and Pattern Recognition, pp. 8769–8778. IEEE,

  13. [13]

    Zhou, K., Yang, J., Loy, C

    doi: 10.1109/TPAMI.2017.2723009. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Conditional prompt learning for vision-language models. InProceed- ings of the IEEE/CVF conference on computer vision and pattern recognition, pp. 16816–16825, 2022a. Zhou, K., Yang, J., Loy, C. C., and Liu, Z. Learning to prompt for vision-language models.International Jour- nal...

  14. [14]

    The key properties for us are1: • (NC1) Variability Collapse:The feature vectors x for all training samples belonging to a class i collapse to their class mean, or centroid,µ i

    describes the geometric structure of the final-layer features and classifier weights during the terminal phase of cross-entropy training (i.e., after achieving near-zero training error). The key properties for us are1: • (NC1) Variability Collapse:The feature vectors x for all training samples belonging to a class i collapse to their class mean, or centro...

  15. [15]

    MS-COCO is a large-scale dataset comprising over 120,000 images with diverse scenes and objects, each paired with five human-annotated captions

    in the appendix. MS-COCO is a large-scale dataset comprising over 120,000 images with diverse scenes and objects, each paired with five human-annotated captions. For both datasets, we follow the standard evaluation protocol using the widely-employed Karpathy splits (Karpathy & Fei-Fei, 2015). Classification.We evaluate our approach for zero- and few-shot ...

  16. [16]

    We use the Places365 validation set (50 images per class) instead of the larger test set (900 images per class) for computational efficiency

    is a scene-centric database designed for scene recognition and understanding. We use the Places365 validation set (50 images per class) instead of the larger test set (900 images per class) for computational efficiency. For zero-shot experiments, we evaluate on all 50 images per class. For few-shot experiments, we split each class into 10 training and 40 ...

  17. [17]

    Regarding the hyperparameters of CSA and text-to-concepts methods, the CSA method (Li et al.,

    We anneal the learning rate following a cosine schedule without restarts (Loshchilov & Hutter, 2017). Regarding the hyperparameters of CSA and text-to-concepts methods, the CSA method (Li et al.,

  18. [18]

    The checkpoint for the text encoder of CLIP ViT-B/32 model is that fromtorch.hub

    library. The checkpoint for the text encoder of CLIP ViT-B/32 model is that fromtorch.hub. F. Cross-modal retrieval Cross-modal retrieval operates on a set of candidate texts and images within a shared latent space. Given a query, the system retrieves the target whose embedding exhibits the highest similarity to the query. In text-to-image retrieval, the ...

  19. [19]

    As an additional experiment, use Flickr30K image-text pairs to align the BEiT-B/16 image encoder to text, and compare zero-shot classification accuracy when using ImageNet-21K weights to augment this alignment data. The results, illustrated in Figure 6, demonstrate that incorporating weight representations alongside image-text pairs consistently enhances ...

  20. [20]

    We compare post-hoc alignment on BEiT-B/16 with CLIP variants

    dataset. We compare post-hoc alignment on BEiT-B/16 with CLIP variants. We observe that all models struggle with this specialized domain, yet our weight-recycling approach performs on par with the CLIP variants. Model Balanced Accuracy (%) Random 14.28 CLIP ViT-B/3218.85 CLIP ViT-L/14@336px 17.72 BEiT-B/16 (Ours) 18.32 Table 10.Zero-shot classification ac...

  21. [21]

    As shown in Table 12, the downstream performance is notably higher when using ImageNet-21K

    library. As shown in Table 12, the downstream performance is notably higher when using ImageNet-21K. This superior performance is likely due to the greater semantic overlap between the diverse ImageNet-21K classes and the classes found in the 21 Supervised Classification Heads as Semantic Prototypes: Unlocking Vision-Language Alignment via Weight Recyclin...