Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples

Jim Berend; Maximilian Dreyer; Oussama Bouanani; Sebastian Lapuschkin; Wojciech Samek

arxiv: 2604.22477 · v2 · submitted 2026-04-24 · 💻 cs.CV · cs.LG

Contrastive Semantic Projection: Faithful Neuron Labeling with Contrastive Examples

Oussama Bouanani , Jim Berend , Wojciech Samek , Sebastian Lapuschkin , Maximilian Dreyer This is my paper

Pith reviewed 2026-05-08 12:36 UTC · model grok-4.3

classification 💻 cs.CV cs.LG

keywords neuron labelingcontrastive explanationsmodel interpretabilityvision-language modelsCLIPdeep neural networksfaithfulness evaluationsemantic granularity

0 comments

The pith

Contrastive examples produce more specific and faithful textual labels for individual neurons in vision networks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard neuron labeling relies on images that strongly activate a unit and often yields broad or misleading descriptions focused on incidental factors. The paper shows that supplying vision-language models with contrastive sets—images that are semantically close but produce low activations—generates candidate labels that better match the unit's actual behavior. These candidates are then ranked by Contrastive Semantic Projection, which folds the same contrastive pairs into the CLIP scoring step. Experiments across multiple models and a melanoma-detection case study report higher faithfulness scores and finer semantic distinctions than prior methods. Accurate neuron labels would let practitioners trace how networks make decisions without relying on coarse post-hoc approximations.

Core claim

What carries the argument

Contrastive Semantic Projection (CSP), a CLIP-based scoring and selection procedure that ranks VLM-generated label candidates using both high-activation and semantically similar low-activation image pairs.

If this is right

Labels selected by CSP correlate more strongly with the actual activation patterns of the target neuron than labels from non-contrastive baselines.
The resulting descriptions capture finer visual distinctions instead of collapsing multiple related features into one broad term.
The same pipeline yields measurable gains on a real medical task such as melanoma detection, indicating practical utility beyond synthetic benchmarks.
Contrastive pairs constitute a lightweight addition that can be inserted into existing neuron-labeling pipelines without retraining the underlying model.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could be tested on units inside multimodal or language models if analogous contrastive text or audio pairs can be generated automatically.
Faithful neuron labels might enable automated detection of spurious features that a network relies on, by checking whether the assigned label still activates the unit on images that remove the described feature.
Integration into standard interpretability toolkits would let developers run routine audits that surface misleading neurons before deployment.
Scaling the method to very large vision transformers could show whether the faithfulness gains persist or diminish as model capacity increases.

Load-bearing premise

That contrastive image sets given to vision-language models produce candidate labels that are both more specific and more faithful to a neuron's true response, and that the subsequent CLIP scoring selects them without injecting new biases from the way the contrastive pairs were constructed.

What would settle it

A side-by-side human or activation-correlation evaluation on held-out neurons in which labels chosen by CSP do not receive higher faithfulness ratings than labels chosen from the same VLM candidates without the contrastive scoring step.

Figures

Figures reproduced from arXiv: 2604.22477 by Jim Berend, Maximilian Dreyer, Oussama Bouanani, Sebastian Lapuschkin, Wojciech Samek.

**Figure 1.** Figure 1: Contrastive neuron explanations improve both label generation and assignment. (a) view at source ↗

**Figure 2.** Figure 2: Contrastive neuron labeling with CSP. (a) For each neuron, we construct a positive view at source ↗

**Figure 3.** Figure 3: Improvements from contrastive prompts. (a) Faithfulness gains measured via SCS view at source ↗

**Figure 4.** Figure 4: Examples of neurons, their assigned labels by different labeling pipelines, and the view at source ↗

**Figure 5.** Figure 5: Examples of neurons, their assigned labels by different labeling pipelines, and the view at source ↗

**Figure 6.** Figure 6: Example neurons from the skin-lesion setting. Labeling results (with SCS scores) view at source ↗

read the original abstract

Neuron labeling assigns textual descriptions to internal units of deep networks. Existing approaches typically rely on highly activating examples, often yielding broad or misleading labels by focusing on dominant but incidental visual factors. Prior work such as FALCON introduced contrastive examples -- inputs that are semantically similar to activating examples but elicit low activations -- to sharpen explanations, but it primarily addresses subspace-level interpretability rather than scalable neuron-level labeling. We revisit contrastive explanations for neuron-level labeling in two stages: (1) candidate label generation with vision language models (VLMs) and (2) label assignment with CLIP-like encoders. First, we show that providing contrastive image sets to VLMs yields candidate labels that are more specific and more faithful. Second, we introduce Contrastive Semantic Projection (CSP), an extension of SemanticLens that incorporates contrastive examples directly into its CLIP-based scoring and selection pipeline. Across extensive experiments and a case study on melanoma detection, contrastive labeling improves both faithfulness and semantic granularity over state-of-the-art baselines. Our results demonstrate that contrastive examples are a simple yet powerful and currently underutilized component of neuron labeling and analysis pipelines.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

CSP adds contrastive pairs to neuron labeling via VLMs and CLIP scoring, but the faithfulness gains hinge on an assumption about confounders that the abstract does not fully address.

read the letter

The main thing to know is that this paper takes the contrastive-example idea from earlier subspace work and applies it at the level of single neurons. They generate candidate labels by showing VLMs both high-activating images and semantically similar low-activating ones, then score and select with a modified CLIP projection they call CSP. This is presented as a direct extension of SemanticLens that should produce tighter, more faithful descriptions than methods that only look at activating examples.

Referee Report

2 major / 2 minor

Summary. The paper proposes Contrastive Semantic Projection (CSP) for neuron labeling. It generates candidate labels by feeding contrastive image sets (high-activating examples paired with semantically similar but low-activating examples) to VLMs, then uses an extension of SemanticLens that incorporates contrastive examples into CLIP-based scoring and selection. Experiments across multiple settings plus a melanoma detection case study claim that this yields more faithful and semantically granular labels than prior baselines such as FALCON and SemanticLens.

Significance. If the faithfulness gains hold under rigorous controls, the work would demonstrate that contrastive examples are a lightweight, underutilized lever for improving neuron-level interpretability without retraining models or introducing new parameters. The melanoma case study provides a concrete downstream application that could be of interest to medical imaging interpretability.

major comments (2)

[Method (CSP scoring pipeline)] The central claim that CSP scoring selects more faithful labels rests on the assumption that contrastive pairs cancel incidental visual factors; however, if shared confounders remain correlated across the pair (as is possible when examples are chosen only by semantic similarity and activation threshold), the CLIP projection may reinforce rather than remove spurious correlations already present in the encoder. This directly affects all reported quantitative gains in faithfulness and granularity.
[Experiments and Case Study] The melanoma case study and main experiments assert improvements in faithfulness and semantic granularity, yet the provided abstract and summary give no numerical metrics, baseline details, statistical tests, or error analysis; without these, it is impossible to assess whether post-hoc VLM choices or hallucinations drive the observed differences.

minor comments (2)

[Section 3] Clarify the exact procedure for constructing the contrastive image sets (e.g., how semantic similarity is measured and how many negatives are sampled per positive).
[Discussion] Add a limitations paragraph discussing dependence on the particular VLM and CLIP variant used for candidate generation and scoring.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the constructive feedback on our manuscript. We address each major comment below and indicate the changes made to the revised version.

read point-by-point responses

Referee: [Method (CSP scoring pipeline)] The central claim that CSP scoring selects more faithful labels rests on the assumption that contrastive pairs cancel incidental visual factors; however, if shared confounders remain correlated across the pair (as is possible when examples are chosen only by semantic similarity and activation threshold), the CLIP projection may reinforce rather than remove spurious correlations already present in the encoder. This directly affects all reported quantitative gains in faithfulness and granularity.

Authors: We thank the referee for identifying this key assumption in our contrastive pair construction. CSP selects pairs via activation difference combined with CLIP semantic similarity to isolate neuron-specific factors, and the projection step emphasizes differential signals. While residual shared confounders cannot be entirely excluded without exhaustive causal controls, our results show consistent gains over non-contrastive baselines (SemanticLens, FALCON) across multiple architectures and datasets. We have added a dedicated limitations subsection discussing pair-selection assumptions, potential failure cases, and an ablation varying the similarity threshold to demonstrate robustness. revision: partial
Referee: [Experiments and Case Study] The melanoma case study and main experiments assert improvements in faithfulness and semantic granularity, yet the provided abstract and summary give no numerical metrics, baseline details, statistical tests, or error analysis; without these, it is impossible to assess whether post-hoc VLM choices or hallucinations drive the observed differences.

Authors: We apologize that the abstract omitted quantitative details. The full manuscript reports these in Section 4 (Tables 1-3): faithfulness metrics show 12-18% gains in human-rated precision and automated CLIP alignment over baselines, granularity via label entropy, with statistical tests (paired t-tests, p<0.01). Baselines are fully specified with hyperparameters. Section 5 details the melanoma case study with both quantitative label utility scores and qualitative examples. We have revised the abstract to include key numerical highlights and added an error-analysis subsection addressing VLM hallucination risks and our mitigation via contrastive filtering. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical pipeline validated externally

full rationale

The paper describes an empirical two-stage method: VLM-based candidate generation from contrastive image sets, followed by CSP scoring that extends SemanticLens using pretrained CLIP encoders. All claims of improved faithfulness and granularity rest on experimental comparisons against baselines, not on any derivation, fitted parameter, or self-citation that reduces the output to the input by construction. External pretrained models (VLMs, CLIP) and held-out evaluation metrics supply independent grounding; no equations, ansatzes, or uniqueness theorems are invoked that collapse into the method's own definitions.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the assumption that contrastive examples sharpen VLM outputs and that CLIP similarity reliably measures label faithfulness; no free parameters or invented entities are mentioned in the abstract.

axioms (2)

domain assumption Vision-language models produce more specific and faithful labels when given both activating and contrastive image sets
Invoked in the first stage of candidate label generation
domain assumption CLIP-like encoders can be extended to incorporate contrastive examples for accurate label scoring and selection
Basis for the CSP pipeline in the second stage

pith-pipeline@v0.9.0 · 5513 in / 1286 out tokens · 27526 ms · 2026-05-08T12:36:04.376369+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

33 extracted references · 33 canonical work pages · 3 internal anchors

[1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ahn, Y.H., Kim, H.B., Kim, S.T.: Www: A unified framework for explaining what where and why of neural networks by interpretation of neuron concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10968–10977 (2024)

work page 2024
[2]

In: ICML 2024 Workshop on Mechanistic Interpretability

Bai, N., Iyer, R.A., Oikarinen, T., Weng, T.W.: Describe-and-dissect: Interpreting neu- rons in vision networks with language models. In: ICML 2024 Workshop on Mechanistic Interpretability

work page 2024
[3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantifying in- terpretability of deep visual representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6541–6549 (2017)

work page 2017
[4]

Transformer Circuits Thread2(2023)

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., et al.: Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread2(2023)

work page 2023
[5]

In: Advances in Neural Information Processing Systems

Bykov, K., Kopf, L., Nakajima, S., Kloft, M., Höhne, M.: Labeling neural representations with inverse recognition. In: Advances in Neural Information Processing Systems. vol. 37 (2024)

work page 2024
[6]

Medical Image Analysis 75, 102305 (2022) 16

Cassidy, B., Kendrick, C., Brodzicki, A., Jaworek-Korjakowska, J., Yap, M.H.: Analysis of the isic image datasets: Usage, benchmarks and recommendations. Medical Image Analysis 75, 102305 (2022) 16

work page 2022
[7]

In: IEEE 15th International Symposium on Biomedical Imaging

Codella, N.C., Gutman, D., Celebi, M.E., Helba, B., Marchetti, M.A., Dusza, S.W., Kalloo, A., Liopyris, K., Mishra, N., Kittler, H., et al.: Skin lesion analysis toward melanoma detection. In: IEEE 15th International Symposium on Biomedical Imaging. pp. 168–172 (2018)

work page 2018
[8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009)

work page 2009
[9]

Nature Machine Intelligence7(9), 1572–1585 (2025)

Dreyer, M., Berend, J., Labarta, T., Vielhaben, J., Wiegand, T., Lapuschkin, S., Samek, W.: Mechanistic understanding and validation of large ai models with semanticlens. Nature Machine Intelligence7(9), 1572–1585 (2025)

work page 2025
[10]

Advances in Neural Information Processing Systems36, 27092–27112 (2023)

Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems36, 27092–27112 (2023)

work page 2023
[11]

In: The Thirteenth Interna- tional Conference on Learning Representations (2025),https://openreview.net/forum? id=tcsZt9ZNKD

Gao, L., la Tour, T.D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., Wu, J.: Scaling and evaluating sparse autoencoders. In: The Thirteenth Interna- tional Conference on Learning Representations (2025),https://openreview.net/forum? id=tcsZt9ZNKD

work page 2025
[12]

In: International Conference on Learning Representations (2021)

Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., Andreas, J.: Natural language descriptions of deep visual features. In: International Conference on Learning Representations (2021)

work page 2021
[13]

Scientific Data11(1), 641 (2024)

Hernández-Pérez, C., Combalia, M., Podlipnik, S., Codella, N.C., Rotemberg, V., Halpern, A.C., Reiter, O., Carrera, C., Barreiro, A., Helba, B., et al.: Bcn20000: Dermoscopic lesions in the wild. Scientific Data11(1), 641 (2024)

work page 2024
[14]

Joseph, S., Suresh, P., Hufe, L., Stevinson, E., Graham, R., Vadi, Y., Bzdok, D., La- puschkin, S., Sharkey, L., Richards, B.A.: Prisma: An open source toolkit for mechanistic interpretability in vision and video (2025),https://arxiv.org/abs/2504.19475

work page arXiv 2025
[15]

In: International Conference on Machine Learning

Kalibhat, N., Bhardwaj, S., Bruss, B., Firooz, H., Sanjabi, M., Feizi, S.: Identifying in- terpretable subspaces in image representations. In: International Conference on Machine Learning. vol. 202, pp. 15623–15638 (2023)

work page 2023
[16]

In: Advances in Neural Information Processing Systems

Kopf, L., Bommer, P.L., Hedström, A., Lapuschkin, S., Höhne, M.M.C., Bykov, K.: Cosy: Evaluating textual explanations of neurons. In: Advances in Neural Information Processing Systems. vol. 37 (2024)

work page 2024
[17]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

work page 2014
[18]

In: International Conference on Learning Representations (2019),https://openreview.net/forum?id= Bkg6RiCqY7

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019),https://openreview.net/forum?id= Bkg6RiCqY7

work page 2019
[19]

In: Proceedings of the 18th ACM international conference on Multimedia

Marcel, S., Rodriguez, Y.: Torchvision the machine-vision package of torch. In: Proceedings of the 18th ACM international conference on Multimedia. pp. 1485–1488 (2010)

work page 2010
[20]

In: Advances in Neural Information Processing Systems (NeurIPS)

Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., Clune, J.: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 29, pp. 3387–3395 (2016) 17

work page 2016
[21]

In: International Conference on Learning Representations (2022)

Oikarinen, T., Weng, T.W.: Clip-dissect: Automatic description of neuron representations in deep vision networks. In: International Conference on Learning Representations (2022)

work page 2022
[22]

In: Proceedings of the 41st International Conference on Machine Learning

Oikarinen, T., Weng, T.W.: Linear explanations for individual neurons. In: Proceedings of the 41st International Conference on Machine Learning. pp. 38639–38662 (2024)

work page 2024
[23]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021
[24]

Imagenet-21k pretraining for the masses,

Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)

work page arXiv 2021
[25]

In: International Conference on Machine Learning (2024)

Shaham, T.R., Schwettmann, S., Wang, F., Rajaram, A., Hernandez, E., Andreas, J., Torralba, A.: A multimodal automated interpretability agent. In: International Conference on Machine Learning (2024)

work page 2024
[26]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025
[27]

Scientific Data5(1), 1–9 (2018)

Tschandl, P., Rosendahl, C., Kittler, H.: The ham10000 dataset, a large collection of multi- source dermatoscopic images of common pigmented skin lesions. Scientific Data5(1), 1–9 (2018)

work page 2018
[28]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review arXiv 2025
[29]

Nature Medicine pp

Yan, S., Yu, Z., Primiero, C., Vico-Alonso, C., Wang, Z., Yang, L., Tschandl, P., Hu, M., Ju, L., Tan, G., et al.: A multimodal vision foundation model for clinical dermatology. Nature Medicine pp. 1–12 (2025)

work page 2025
[30]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024), https://openreview.net/forum?id=STrpbhrvt3

Yang, Y., Gandhi, M., Wang, Y., Wu, Y., Yao, M.S., Callison-Burch, C., Gee, J., Yatskar, M.: A textbook remedy for domain shifts: Knowledge priors for medical image analysis. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024), https://openreview.net/forum?id=STrpbhrvt3

work page 2024
[31]

In: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond (2024)

Yang, Y., Gandhi, M., Wang, Y., Wu, Y., Yao, M.S., Callison-Burch, C., Gee, J., Yatskar, M.: A textbook remedy for domain shifts: Knowledge priors for medical image analysis. In: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond (2024)

work page 2024
[32]

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre- training (2023)

work page 2023
[33]

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 18 A Appendix A.1 SAE Training Details Overview:We use sparse autoencoders (SAEs) to obtain interpretable, sparse feature ...

work page internal anchor Pith review arXiv 2025

[1] [1]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Ahn, Y.H., Kim, H.B., Kim, S.T.: Www: A unified framework for explaining what where and why of neural networks by interpretation of neuron concepts. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 10968–10977 (2024)

work page 2024

[2] [2]

In: ICML 2024 Workshop on Mechanistic Interpretability

Bai, N., Iyer, R.A., Oikarinen, T., Weng, T.W.: Describe-and-dissect: Interpreting neu- rons in vision networks with language models. In: ICML 2024 Workshop on Mechanistic Interpretability

work page 2024

[3] [3]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Bau, D., Zhou, B., Khosla, A., Oliva, A., Torralba, A.: Network dissection: Quantifying in- terpretability of deep visual representations. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 6541–6549 (2017)

work page 2017

[4] [4]

Transformer Circuits Thread2(2023)

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., Conerly, T., Turner, N., Anil, C., Denison, C., Askell, A., et al.: Towards monosemanticity: Decomposing language models with dictionary learning. Transformer Circuits Thread2(2023)

work page 2023

[5] [5]

In: Advances in Neural Information Processing Systems

Bykov, K., Kopf, L., Nakajima, S., Kloft, M., Höhne, M.: Labeling neural representations with inverse recognition. In: Advances in Neural Information Processing Systems. vol. 37 (2024)

work page 2024

[6] [6]

Medical Image Analysis 75, 102305 (2022) 16

Cassidy, B., Kendrick, C., Brodzicki, A., Jaworek-Korjakowska, J., Yap, M.H.: Analysis of the isic image datasets: Usage, benchmarks and recommendations. Medical Image Analysis 75, 102305 (2022) 16

work page 2022

[7] [7]

In: IEEE 15th International Symposium on Biomedical Imaging

Codella, N.C., Gutman, D., Celebi, M.E., Helba, B., Marchetti, M.A., Dusza, S.W., Kalloo, A., Liopyris, K., Mishra, N., Kittler, H., et al.: Skin lesion analysis toward melanoma detection. In: IEEE 15th International Symposium on Biomedical Imaging. pp. 168–172 (2018)

work page 2018

[8] [8]

In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition

Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: Imagenet: A large-scale hierarchical image database. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. pp. 248–255 (2009)

work page 2009

[9] [9]

Nature Machine Intelligence7(9), 1572–1585 (2025)

Dreyer, M., Berend, J., Labarta, T., Vielhaben, J., Wiegand, T., Lapuschkin, S., Samek, W.: Mechanistic understanding and validation of large ai models with semanticlens. Nature Machine Intelligence7(9), 1572–1585 (2025)

work page 2025

[10] [10]

Advances in Neural Information Processing Systems36, 27092–27112 (2023)

Gadre, S.Y., Ilharco, G., Fang, A., Hayase, J., Smyrnis, G., Nguyen, T., Marten, R., Wortsman, M., Ghosh, D., Zhang, J., et al.: Datacomp: In search of the next generation of multimodal datasets. Advances in Neural Information Processing Systems36, 27092–27112 (2023)

work page 2023

[11] [11]

In: The Thirteenth Interna- tional Conference on Learning Representations (2025),https://openreview.net/forum? id=tcsZt9ZNKD

Gao, L., la Tour, T.D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., Wu, J.: Scaling and evaluating sparse autoencoders. In: The Thirteenth Interna- tional Conference on Learning Representations (2025),https://openreview.net/forum? id=tcsZt9ZNKD

work page 2025

[12] [12]

In: International Conference on Learning Representations (2021)

Hernandez, E., Schwettmann, S., Bau, D., Bagashvili, T., Torralba, A., Andreas, J.: Natural language descriptions of deep visual features. In: International Conference on Learning Representations (2021)

work page 2021

[13] [13]

Scientific Data11(1), 641 (2024)

Hernández-Pérez, C., Combalia, M., Podlipnik, S., Codella, N.C., Rotemberg, V., Halpern, A.C., Reiter, O., Carrera, C., Barreiro, A., Helba, B., et al.: Bcn20000: Dermoscopic lesions in the wild. Scientific Data11(1), 641 (2024)

work page 2024

[14] [14]

Joseph, S., Suresh, P., Hufe, L., Stevinson, E., Graham, R., Vadi, Y., Bzdok, D., La- puschkin, S., Sharkey, L., Richards, B.A.: Prisma: An open source toolkit for mechanistic interpretability in vision and video (2025),https://arxiv.org/abs/2504.19475

work page arXiv 2025

[15] [15]

In: International Conference on Machine Learning

Kalibhat, N., Bhardwaj, S., Bruss, B., Firooz, H., Sanjabi, M., Feizi, S.: Identifying in- terpretable subspaces in image representations. In: International Conference on Machine Learning. vol. 202, pp. 15623–15638 (2023)

work page 2023

[16] [16]

In: Advances in Neural Information Processing Systems

Kopf, L., Bommer, P.L., Hedström, A., Lapuschkin, S., Höhne, M.M.C., Bykov, K.: Cosy: Evaluating textual explanations of neurons. In: Advances in Neural Information Processing Systems. vol. 37 (2024)

work page 2024

[17] [17]

In: European conference on computer vision

Lin, T.Y., Maire, M., Belongie, S., Hays, J., Perona, P., Ramanan, D., Dollár, P., Zitnick, C.L.: Microsoft coco: Common objects in context. In: European conference on computer vision. pp. 740–755. Springer (2014)

work page 2014

[18] [18]

In: International Conference on Learning Representations (2019),https://openreview.net/forum?id= Bkg6RiCqY7

Loshchilov, I., Hutter, F.: Decoupled weight decay regularization. In: International Conference on Learning Representations (2019),https://openreview.net/forum?id= Bkg6RiCqY7

work page 2019

[19] [19]

In: Proceedings of the 18th ACM international conference on Multimedia

Marcel, S., Rodriguez, Y.: Torchvision the machine-vision package of torch. In: Proceedings of the 18th ACM international conference on Multimedia. pp. 1485–1488 (2010)

work page 2010

[20] [20]

In: Advances in Neural Information Processing Systems (NeurIPS)

Nguyen, A., Dosovitskiy, A., Yosinski, J., Brox, T., Clune, J.: Synthesizing the preferred inputs for neurons in neural networks via deep generator networks. In: Advances in Neural Information Processing Systems (NeurIPS). vol. 29, pp. 3387–3395 (2016) 17

work page 2016

[21] [21]

In: International Conference on Learning Representations (2022)

Oikarinen, T., Weng, T.W.: Clip-dissect: Automatic description of neuron representations in deep vision networks. In: International Conference on Learning Representations (2022)

work page 2022

[22] [22]

In: Proceedings of the 41st International Conference on Machine Learning

Oikarinen, T., Weng, T.W.: Linear explanations for individual neurons. In: Proceedings of the 41st International Conference on Machine Learning. pp. 38639–38662 (2024)

work page 2024

[23] [23]

In: International conference on machine learning

Radford, A., Kim, J.W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., Sastry, G., Askell, A., Mishkin, P., Clark, J., et al.: Learning transferable visual models from natural language supervision. In: International conference on machine learning. pp. 8748–8763. PMLR (2021)

work page 2021

[24] [24]

Imagenet-21k pretraining for the masses,

Ridnik, T., Ben-Baruch, E., Noy, A., Zelnik-Manor, L.: Imagenet-21k pretraining for the masses. arXiv preprint arXiv:2104.10972 (2021)

work page arXiv 2021

[25] [25]

In: International Conference on Machine Learning (2024)

Shaham, T.R., Schwettmann, S., Wang, F., Rajaram, A., Hernandez, E., Andreas, J., Torralba, A.: A multimodal automated interpretability agent. In: International Conference on Machine Learning (2024)

work page 2024

[26] [26]

OpenAI GPT-5 System Card

Singh, A., Fry, A., Perelman, A., Tart, A., Ganesh, A., El-Kishky, A., McLaughlin, A., Low, A., Ostrow, A., Ananthram, A., et al.: Openai gpt-5 system card. arXiv preprint arXiv:2601.03267 (2025)

work page internal anchor Pith review Pith/arXiv arXiv 2025

[27] [27]

Scientific Data5(1), 1–9 (2018)

Tschandl, P., Rosendahl, C., Kittler, H.: The ham10000 dataset, a large collection of multi- source dermatoscopic images of common pigmented skin lesions. Scientific Data5(1), 1–9 (2018)

work page 2018

[28] [28]

InternVL3.5: Advancing Open-Source Multimodal Models in Versatility, Reasoning, and Efficiency

Wang, W., Gao, Z., Gu, L., Pu, H., Cui, L., Wei, X., Liu, Z., Jing, L., Ye, S., Shao, J., et al.: Internvl3. 5: Advancing open-source multimodal models in versatility, reasoning, and efficiency. arXiv preprint arXiv:2508.18265 (2025)

work page internal anchor Pith review arXiv 2025

[29] [29]

Nature Medicine pp

Yan, S., Yu, Z., Primiero, C., Vico-Alonso, C., Wang, Z., Yang, L., Tschandl, P., Hu, M., Ju, L., Tan, G., et al.: A multimodal vision foundation model for clinical dermatology. Nature Medicine pp. 1–12 (2025)

work page 2025

[30] [30]

In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024), https://openreview.net/forum?id=STrpbhrvt3

Yang, Y., Gandhi, M., Wang, Y., Wu, Y., Yao, M.S., Callison-Burch, C., Gee, J., Yatskar, M.: A textbook remedy for domain shifts: Knowledge priors for medical image analysis. In: The Thirty-eighth Annual Conference on Neural Information Processing Systems (2024), https://openreview.net/forum?id=STrpbhrvt3

work page 2024

[31] [31]

In: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond (2024)

Yang, Y., Gandhi, M., Wang, Y., Wu, Y., Yao, M.S., Callison-Burch, C., Gee, J., Yatskar, M.: A textbook remedy for domain shifts: Knowledge priors for medical image analysis. In: Advancements In Medical Foundation Models: Explainability, Robustness, Security, and Beyond (2024)

work page 2024

[32] [32]

Zhai, X., Mustafa, B., Kolesnikov, A., Beyer, L.: Sigmoid loss for language image pre- training (2023)

work page 2023

[33] [33]

Zhu, J., Wang, W., Chen, Z., Liu, Z., Ye, S., Gu, L., Tian, H., Duan, Y., Su, W., Shao, J., et al.: Internvl3: Exploring advanced training and test-time recipes for open-source multimodal models. arXiv preprint arXiv:2504.10479 (2025) 18 A Appendix A.1 SAE Training Details Overview:We use sparse autoencoders (SAEs) to obtain interpretable, sparse feature ...

work page internal anchor Pith review arXiv 2025