Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples
Pith reviewed 2026-05-08 12:36 UTC · model grok-4.3
The pith
Contrastive examples produce more specific and faithful textual labels for individual neurons in vision networks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Contrastive Semantic Projection (CSP) extends SemanticLens by incorporating contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines.
What carries the argument
Contrastive Semantic Projection (CSP), a CLIP-based scoring and selection procedure that ranks VLM-generated label candidates using both high-activation and semantically similar low-activation image pairs.
If this is right
- Labels selected by CSP correlate more strongly with the actual activation patterns of the target neuron than labels from non-contrastive baselines.
- The resulting descriptions capture finer visual distinctions instead of collapsing multiple related features into one broad term.
- The same pipeline yields measurable gains on a real medical task such as melanoma detection, indicating practical utility beyond synthetic benchmarks.
- Contrastive pairs constitute a lightweight addition that can be inserted into existing neuron-labeling pipelines without retraining the underlying model.
Where Pith is reading between the lines
- The approach could be tested on units inside multimodal or language models if analogous contrastive text or audio pairs can be generated automatically.
- Faithful neuron labels might enable automated detection of spurious features that a network relies on, by checking whether the assigned label still activates the unit on images that remove the described feature.
- Integration into standard interpretability toolkits would let developers run routine audits that surface misleading neurons before deployment.
- Scaling the method to very large vision transformers could show whether the faithfulness gains persist or diminish as model capacity increases.
Load-bearing premise
That contrastive image sets given to vision-language models produce candidate labels that are both more specific and more faithful to a neuron's true response, and that the subsequent CLIP scoring selects them without injecting new biases from the way the contrastive pairs were constructed.
What would settle it
A side-by-side human or activation-correlation evaluation on held-out neurons in which labels chosen by CSP do not receive higher faithfulness ratings than labels chosen from the same VLM candidates without the contrastive scoring step.
Figures
read the original abstract
Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples -- inputs that are semantically similar to activating examples but elicit low activations -- to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Contrastive Semantic Projection (CSP) for neuron labeling. It generates candidate labels by feeding contrastive image sets (high-activating examples paired with semantically similar but low-activating examples) to VLMs, then uses an extension of SemanticLens that incorporates contrastive examples into CLIP-based scoring and selection. Experiments across multiple settings plus a melanoma detection case study claim that this yields more faithful and semantically granular labels than prior baselines such as FALCON and SemanticLens.
Significance. If the faithfulness gains hold under rigorous controls, the work would demonstrate that contrastive examples are a lightweight, underutilized lever for improving neuron-level interpretability without retraining models or introducing new parameters. The melanoma case study provides a concrete downstream application that could be of interest to medical imaging interpretability.
major comments (2)
- [Method (CSP scoring pipeline)] The central claim that CSP scoring selects more faithful labels rests on the assumption that contrastive pairs cancel incidental visual factors; however, if shared confounders remain correlated across the pair (as is possible when examples are chosen only by semantic similarity and activation threshold), the CLIP projection may reinforce rather than remove spurious correlations already present in the encoder. This directly affects all reported quantitative gains in faithfulness and granularity.
- [Experiments and Case Study] The melanoma case study and main experiments assert improvements in faithfulness and semantic granularity, yet the provided abstract and summary give no numerical metrics, baseline details, statistical tests, or error analysis; without these, it is impossible to assess whether post-hoc VLM choices or hallucinations drive the observed differences.
minor comments (2)
- [Section 3] Clarify the exact procedure for constructing the contrastive image sets (e.g., how semantic similarity is measured and how many negatives are sampled per positive).
- [Discussion] Add a limitations paragraph discussing dependence on the particular VLM and CLIP variant used for candidate generation and scoring.
Simulated Author's Rebuttal
Thank you for the constructive feedback on our manuscript. We address each major comment below and indicate the changes made to the revised version.
read point-by-point responses
-
Referee: [Method (CSP scoring pipeline)] The central claim that CSP scoring selects more faithful labels rests on the assumption that contrastive pairs cancel incidental visual factors; however, if shared confounders remain correlated across the pair (as is possible when examples are chosen only by semantic similarity and activation threshold), the CLIP projection may reinforce rather than remove spurious correlations already present in the encoder. This directly affects all reported quantitative gains in faithfulness and granularity.
Authors: We thank the referee for identifying this key assumption in our contrastive pair construction. CSP selects pairs via activation difference combined with CLIP semantic similarity to isolate neuron-specific factors, and the projection step emphasizes differential signals. While residual shared confounders cannot be entirely excluded without exhaustive causal controls, our results show consistent gains over non-contrastive baselines (SemanticLens, FALCON) across multiple architectures and datasets. We have added a dedicated limitations subsection discussing pair-selection assumptions, potential failure cases, and an ablation varying the similarity threshold to demonstrate robustness. revision: partial
-
Referee: [Experiments and Case Study] The melanoma case study and main experiments assert improvements in faithfulness and semantic granularity, yet the provided abstract and summary give no numerical metrics, baseline details, statistical tests, or error analysis; without these, it is impossible to assess whether post-hoc VLM choices or hallucinations drive the observed differences.
Authors: We apologize that the abstract omitted quantitative details. The full manuscript reports these in Section 4 (Tables 1-3): faithfulness metrics show 12-18% gains in human-rated precision and automated CLIP alignment over baselines, granularity via label entropy, with statistical tests (paired t-tests, p<0.01). Baselines are fully specified with hyperparameters. Section 5 details the melanoma case study with both quantitative label utility scores and qualitative examples. We have revised the abstract to include key numerical highlights and added an error-analysis subsection addressing VLM hallucination risks and our mitigation via contrastive filtering. revision: yes
Circularity Check
No circularity: empirical pipeline validated externally
full rationale
The paper describes an empirical two-stage method: VLM-based candidate generation from contrastive image sets, followed by CSP scoring that extends SemanticLens using pretrained CLIP encoders. All claims of improved faithfulness and granularity rest on experimental comparisons against baselines, not on any derivation, fitted parameter, or self-citation that reduces the output to the input by construction. External pretrained models (VLMs, CLIP) and held-out evaluation metrics supply independent grounding; no equations, ansatzes, or uniqueness theorems are invoked that collapse into the method's own definitions.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption Vision-language models produce more specific and faithful labels when given both activating and contrastive image sets
- domain assumption CLIP-like encoders can be extended to incorporate contrastive examples for accurate label scoring and selection
Reference graph
Works this paper leans on
-
[1]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Ahn, Y.H., Kim, H.B., Kim, S.T.: Www: A unified framework for explaining what where and why of neural networks by interpretation of neuron concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10968–10977 (2024)
work page 2024
-
[2]
In: ICML 2024 Workshop on Mechanistic Interpretability
Bai, N., Iyer, R.A., Oikarinen, T., Weng, T.W.: Describe-and-dissect: Interpreting neu- rons in vision networks with language models. In: ICML 2024 Workshop on Mechanistic Interpretability
work page 2024
-
[3]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantifying in- terpretability of deep visual representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6541–6549 (2017)
work page 2017
-
[4]
Transformer Circuits Thread2(2023)
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., et al.: Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread2(2023)
work page 2023
-
[5]
In: Advances in Neural Information Processing Systems
Bykov, K., Kopf, L., Nakajima, S., Kloft, M., Höhne, M.: Labeling neural representations with inverse recognition. In: Advances in Neural Information Processing Systems. vol. 37 (2024)
work page 2024
-
[6]
Medical Image Analysis 75, 102305 (2022) 16
Cassidy, B., Kendrick, C., Brodzicki, A., Jaworek-Korjakowska, J., Yap, M.H.: Analysis of the isic image datasets: Usage, benchmarks and recommendations. Medical Image Analysis 75, 102305 (2022) 16
work page 2022
-
[7]
In: IEEE 15th International Symposium on Biomedical Imaging
Codella, N.C., Gutman, D., Celebi, M.E., Helba, B., Marchetti, M.A., Dusza, S.W., Kalloo, A., Liopyris, K., Mishra, N., Kittler, H., et al.: Skin lesion analysis toward melanoma detection. In: IEEE 15th International Symposium on Biomedical Imaging. pp. 168–172 (2018)
work page 2018
-
[8]
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition
Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009)
work page 2009
-
[9]
Nature Machine Intelligence7(9), 1572–1585 (2025)
Dreyer, M., Berend, J., Labarta, T., Vielhaben, J., Wiegand, T., Lapuschkin, S., Samek, W.: Mechanistic understanding and validation of large ai models with semanticlens. Nature Machine Intelligence7(9), 1572–1585 (2025)
work page 2025
-
[10]
Advances in Neural Information Processing Systems36, 27092–27112 (2023)
Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems36, 27092–27112 (2023)
work page 2023
-
[11]
Gao, L., la Tour, T.D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., Wu, J.: Scaling and evaluating sparse autoencoders. In: The Thirteenth Interna- tional Conference on Learning Representations (2025),https://openreview.net/forum? id=tcsZt9ZNKD
work page 2025
-
[12]
In: International Conference on Learning Representations (2021)
Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., Andreas, J.: Natural language descriptions of deep visual features. In: International Conference on Learning Representations (2021)
work page 2021
-
[13]
Scientific Data11(1), 641 (2024)
Hernández-Pérez, C., Combalia, M., Podlipnik, S., Codella, N.C., Rotemberg, V., Halpern, A.C., Reiter, O., Carrera, C., Barreiro, A., Helba, B., et al.: Bcn20000: Dermoscopic lesions in the wild. Scientific Data11(1), 641 (2024)
work page 2024
- [14]
-
[15]
In: International Conference on Machine Learning
Kalibhat, N., Bhardwaj, S., Bruss, B., Firooz, H., Sanjabi, M., Feizi, S.: Identifying in- terpretable subspaces in image representations. In: International Conference on Machine Learning. vol. 202, pp. 15623–15638 (2023)
work page 2023
-
[16]
In: Advances in Neural Information Processing Systems
Kopf, L., Bommer, P.L., Hedström, A., Lapuschkin, S., Höhne, M.M.C., Bykov, K.: Cosy: Evaluating textual explanations of neurons. In: Advances in Neural Information Processing Systems. vol. 37 (2024)
work page 2024
-
[17]
In: European conference on computer vision
Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)
work page 2014
-
[18]
Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019),https://openreview.net/forum?id= Bkg6RiCqY7
work page 2019
-
[19]
In: Proceedings of the 18th ACM international conference on Multimedia
Marcel, S., Rodriguez, Y.: Torchvision the machine-vision package of torch. In: Proceedings of the 18th ACM international conference on Multimedia. pp. 1485–1488 (2010)
work page 2010
-
[20]
In: Advances in Neural Information Processing Systems (NeurIPS)
Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., Clune, J.: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 29, pp. 3387–3395 (2016) 17
work page 2016
-
[21]
In: International Conference on Learning Representations (2022)
Oikarinen, T., Weng, T.W.: Clip-dissect: Automatic description of neuron representations in deep vision networks. In: International Conference on Learning Representations (2022)
work page 2022
-
[22]
In: Proceedings of the 41st International Conference on Machine Learning
Oikarinen, T., Weng, T.W.: Linear explanations for individual neurons. In: Proceedings of the 41st International Conference on Machine Learning. pp. 38639–38662 (2024)
work page 2024
-
[23]
In: International conference on machine learning
Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)
work page 2021
-
[24]
Imagenet-21k pretraining for the masses,
Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)
-
[25]
In: International Conference on Machine Learning (2024)
Shaham, T.R., Schwettmann, S., Wang, F., Rajaram, A., Hernandez, E., Andreas, J., Torralba, A.: A multimodal automated interpretability agent. In: International Conference on Machine Learning (2024)
work page 2024
-
[26]
Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[27]
Scientific Data5(1), 1–9 (2018)
Tschandl, P., Rosendahl, C., Kittler, H.: The ham10000 dataset, a large collection of multi- source dermatoscopic images of common pigmented skin lesions. Scientific Data5(1), 1–9 (2018)
work page 2018
-
[28]
InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency
Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)
work page internal anchor Pith review arXiv 2025
-
[29]
Yan, S., Yu, Z., Primiero, C., Vico-Alonso, C., Wang, Z., Yang, L., Tschandl, P., Hu, M., Ju, L., Tan, G., et al.: A multimodal vision foundation model for clinical dermatology. Nature Medicine pp. 1–12 (2025)
work page 2025
-
[30]
Yang, Y., Gandhi, M., Wang, Y., Wu, Y., Yao, M.S., Callison-Burch, C., Gee, J., Yatskar, M.: A textbook remedy for domain shifts: Knowledge priors for medical image analysis. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024), https://openreview.net/forum?id=STrpbhrvt3
work page 2024
-
[31]
Yang, Y., Gandhi, M., Wang, Y., Wu, Y., Yao, M.S., Callison-Burch, C., Gee, J., Yatskar, M.: A textbook remedy for domain shifts: Knowledge priors for medical image analysis. In: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond (2024)
work page 2024
-
[32]
Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre- training (2023)
work page 2023
-
[33]
Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 18 A Appendix A.1 SAE Training Details Overview:We use sparse autoencoders (SAEs) to obtain interpretable, sparse feature ...
work page internal anchor Pith review arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.