MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
Pith reviewed 2026-05-25 07:39 UTC · model grok-4.3
The pith
Sparse autoencoders applied to MedCLIP produce neurons with higher monosemanticity and interpretability than the original features on chest X-ray data.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Applying MedSAEs to MedCLIP's latent space decomposes the representations into neurons that exhibit greater monosemanticity and interpretability than the original MedCLIP features, as quantified by the evaluation framework of correlation metrics, entropy analyses, and MedGemma-driven automated naming on the CheXpert dataset.
What carries the argument
MedSAE, a sparse autoencoder trained on MedCLIP latent representations to extract more monosemantic components for medical image analysis.
If this is right
- Individual MedSAE neurons can be inspected to reveal which clinical concepts MedCLIP relies on for its predictions.
- The same evaluation framework can be reused on other medical vision-language models to compare their feature interpretability.
- More interpretable representations support auditing and debugging of AI systems used in radiology workflows.
- The approach offers a method to retain model performance while increasing clinical transparency.
Where Pith is reading between the lines
- Extending MedSAE training to additional imaging modalities such as CT could test whether the monosemanticity gains generalize beyond radiographs.
- If MedSAE features map directly to clinical findings, they could allow targeted interventions to correct model errors in specific disease categories.
- Incorporating MedSAE-style sparsity during initial model training might produce inherently more interpretable medical representations from the start.
Load-bearing premise
The proposed metrics of correlation, entropy, and automated naming reliably capture true monosemanticity and interpretability without introducing bias from the evaluation tools themselves.
What would settle it
A controlled study on a held-out medical imaging dataset where human radiologists rate MedSAE neurons as no more interpretable than raw MedCLIP features, or where the automated metrics fail to align with those human ratings.
read the original abstract
Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyses, and automated neuron naming via the MedGemma foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations. The source code supporting the findings of this study is available at https://github.com/EIDOSLAB/MedSAE.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper applies sparse autoencoders (MedSAEs) to the latent representations of MedCLIP, a vision-language model for chest radiographs, and introduces an evaluation framework combining correlation metrics, entropy analyses, and automated neuron naming via MedGemma. Experiments on CheXpert are claimed to demonstrate that MedSAE neurons exhibit higher monosemanticity and interpretability than raw MedCLIP features, with code released at the provided GitHub link.
Significance. If the central empirical claim holds under a validated evaluation protocol, the work would offer a concrete, scalable route to mechanistic interpretability for medical VLMs, potentially aiding clinical trust and debugging. The combination of SAEs with domain-specific automated naming is a natural extension of existing SAE literature to healthcare, but the absence of quantitative results, error bars, or ablation details in the abstract limits immediate assessment of effect sizes or robustness.
major comments (2)
- [Abstract / Evaluation Framework] Abstract and evaluation framework (implied §4): the claim that MedSAE neurons achieve higher monosemanticity rests entirely on correlation metrics, entropy, and MedGemma-generated names, yet no human-annotation baseline, inter-rater agreement, or calibration against ground-truth clinical concepts is reported. This leaves open the possibility that the framework systematically favors features legible to MedGemma rather than objectively monosemantic ones.
- [Experiments] Experiments section: the abstract states results on CheXpert but supplies no numerical values, confidence intervals, ablation controls (e.g., SAE sparsity levels, dictionary size), or comparison to alternative interpretability methods. Without these, it is impossible to judge whether the reported improvement is load-bearing or sensitive to post-hoc choices.
minor comments (1)
- [Abstract] The abstract mentions 'source code supporting the findings' but does not specify which exact metrics, hyperparameters, or MedGemma prompts are released, hindering reproducibility.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.
read point-by-point responses
-
Referee: [Abstract / Evaluation Framework] Abstract and evaluation framework (implied §4): the claim that MedSAE neurons achieve higher monosemanticity rests entirely on correlation metrics, entropy, and MedGemma-generated names, yet no human-annotation baseline, inter-rater agreement, or calibration against ground-truth clinical concepts is reported. This leaves open the possibility that the framework systematically favors features legible to MedGemma rather than objectively monosemantic ones.
Authors: We acknowledge that human validation would provide stronger corroboration of monosemanticity. The framework relies on established quantitative metrics from the SAE literature (correlation with clinical labels and entropy for feature purity) plus automated naming with a medically aligned model (MedGemma) chosen precisely because it matches the domain. A full human-annotation study with inter-rater statistics lies outside the scope of the current work due to resource constraints. In revision we will add an explicit limitations paragraph discussing this point and outlining future calibration against ground-truth clinical concepts. revision: partial
-
Referee: [Experiments] Experiments section: the abstract states results on CheXpert but supplies no numerical values, confidence intervals, ablation controls (e.g., SAE sparsity levels, dictionary size), or comparison to alternative interpretability methods. Without these, it is impossible to judge whether the reported improvement is load-bearing or sensitive to post-hoc choices.
Authors: The Experiments section of the full manuscript reports numerical results, ablations over sparsity levels and dictionary sizes, and comparisons against raw features. The abstract, however, is a high-level summary and omits these specifics. We will revise the abstract to include representative quantitative values, mention the ablation controls, and note the comparison baseline so that effect sizes and robustness are immediately visible to readers. revision: yes
Circularity Check
No circularity; empirical comparison uses external MedGemma and independent metrics without reduction to inputs.
full rationale
The paper applies MedSAEs to MedCLIP latents and reports higher monosemanticity/interpretability on CheXpert via correlation, entropy, and MedGemma naming. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The evaluation framework is presented as a proposed method rather than a self-definitional loop, and MedGemma is an external model. This is a standard empirical claim with no load-bearing self-referential steps.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 3 Pith papers
-
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Sparse autoencoders on EEG transformers identify three regimes of clinical concept encoding and reveal entanglements such as age-pathology confounding via a new steering selectivity metric.
-
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
TopK SAEs applied to EEG transformers extract clinical features, enable concept steering, and identify selectively steerable, entangled, and non-encoded regimes with a spectral decoder for physiological interpretation.
-
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
Sparse autoencoders on EEG transformers extract clinical features, identify three steering regimes, expose age-pathology entanglements and wrecking-ball failures, and map interventions to frequency spectra.
Reference graph
Works this paper leans on
-
[1]
Deep learning models have achieved remarkable results in various medical tasks [1, 2]
INTRODUCTION The emergence of Artificial Intelligence (AI) in healthcare systems is revolutionizing the way patients are diagnosed, treated, and monitored. Deep learning models have achieved remarkable results in various medical tasks [1, 2]. As the vol- ume of medical data continues to grow, the size and complex- ity of neural network architectures have ...
-
[2]
Assess monosemanticity
-
[3]
1: The overall proposed pipeline
Automated Interpretability Activating images Training data MedCLIP Vision Encoder(frozen) MedSAE (trainable) Fig. 1: The overall proposed pipeline. (1) We first train Med- SAE from MedCLIP vision encoder and extract correspond- ing embeddings. (2) Then, we compute their Pearson correla- tion with one-hot encoded vector labels to identify MedSAE neurons-co...
-
[4]
MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders
BACKGROUND AND RELA TED WORK 2.1. Sparse Autoencoders Mechanistic interpretability aims to decompose deep learn- ing models into human-understandable components by ana- lyzing their model activations. This includes labeling circuits arXiv:2510.26411v1 [cs.AI] 30 Oct 2025 and neurons using feature visualization techniques, as well as quantifying concept ac...
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[5]
METHODOLOGY We now present our methodology for extracting and evalu- ating interpretable latent representations using SAEs. As il- lustrated in Figure 1, our MedSAE pipeline comprises three main stages: (1) training SAEs on MedCLIP embeddings, (2) assessing neuron monosemanticity, and (3) performing auto- mated interpretability and neuron naming. Each ste...
-
[6]
EXPERIMENTS AND RESULTS This section evaluates our previously introduced method, showing that MedSAEs effectively disentangle superposed representations in MedCLIP embeddings, revealing struc- tured and clinically meaningful features. These results sup- port the potential of SAEs as a practical tool for mecha- nistic interpretability in vision-language mo...
-
[7]
CONCLUSION This work presents a first step toward mechanistic inter- pretability in medical vision-language models by leveraging sparse autoencoders to extract clinically meaningful and monosemantic features from MedCLIP embeddings. Our evaluation framework, combining correlation metrics and automated neuron naming via MedGEMMA, demonstrates improved inte...
-
[8]
A survey on deep learning in med- ical image analysis,
Geert Litjens et al., “A survey on deep learning in med- ical image analysis,”Medical Image Analysis, vol. 42, pp. 60–88, 2017. 1
work page 2017
-
[9]
P. Celard et al., “A survey on deep learning applied to medical images: from simple artificial neural networks to generative models,”Neural Computing and Applica- tions, vol. 35, no. 3, pp. 2291–2323, Jan 2023. 1
work page 2023
-
[10]
Multimodal large language models: A survey,
Jiayang Wu et al., “Multimodal large language models: A survey,” in2023 IEEE International Conference on Big Data (BigData), 2023, pp. 2247–2256. 1
work page 2023
-
[11]
Mechanistic in- terpretability for AI safety - a review,
Leonard Bereska and Stratis Gavves, “Mechanistic in- terpretability for AI safety - a review,”Transactions on Machine Learning Research, 2024, Survey Certifica- tion, Expert Certification. 1, 2
work page 2024
-
[12]
MedCLIP: Contrastive learning from unpaired medical images and text,
Zifeng Wang et al., “MedCLIP: Contrastive learning from unpaired medical images and text,” inProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, Eds., Abu Dhabi, United Arab Emirates, Dec. 2022, pp. 3876–3887, Association for Computational Linguistics. 1
work page 2022
- [13]
-
[14]
A mathematical frame- work for transformer circuits,
Nelson Elhage et al., “A mathematical frame- work for transformer circuits,”Transformer Circuits Thread, 2021, https://transformer- circuits.pub/2021/framework/index.html. 2
work page 2021
-
[15]
Towards monosemanticity: De- composing language models with dictionary learning,
Trenton Bricken et al., “Towards monosemanticity: De- composing language models with dictionary learning,” Transformer Circuits Thread, 2023, https://transformer- circuits.pub/2023/monosemantic-features/index.html. 2
work page 2023
-
[16]
Mechanistic understanding and validation of large AI models with SemanticLens,
Maximilian Dreyer et al., “Mechanistic understanding and validation of large AI models with SemanticLens,” Nature Machine Intelligence, pp. 1–14, Aug. 2025, Pub- lisher: Nature Publishing Group. 2
work page 2025
-
[17]
An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation,
Ahmed Abdulaal et al., “An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation,” 2024. 2
work page 2024
-
[18]
Learning biologically relevant features in a pathology foundation model using sparse autoencoders,
Nhat Minh Le et al., “Learning biologically relevant features in a pathology foundation model using sparse autoencoders,” inAdvancements In Medical Founda- tion Models: Explainability, Robustness, Security, and Beyond, 2024. 2
work page 2024
-
[19]
Interpreting CLIP with sparse lin- ear concept embeddings (spliCE),
Usha Bhalla et al., “Interpreting CLIP with sparse lin- ear concept embeddings (spliCE),” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2
work page 2024
-
[20]
Tom Conerly et al., “Update on how we train saes,” Transformer Circuits Thread, 2024. 2
work page 2024
-
[21]
Automatically interpreting millions of features in large language models,
Gonc ¸alo Santos Paulo et al., “Automatically interpreting millions of features in large language models,” inForty- second International Conference on Machine Learning,
-
[22]
Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison,
Jeremy Irvin et al., “Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison,” inProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Appli- cations of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. 2019, AAA...
work page 2019
-
[23]
Scaling and evaluating sparse autoen- coders,
Leo Gao et al., “Scaling and evaluating sparse autoen- coders,” inThe Thirteenth International Conference on Learning Representations, 2025. 3
work page 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.