MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

Colas Lepoutre; Enrico Cassano; Marco Grangetto; Riccardo Renzulli

arxiv: 2510.26411 · v2 · pith:OMTDGNMVnew · submitted 2025-10-30 · 💻 cs.AI

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

Riccardo Renzulli , Colas Lepoutre , Enrico Cassano , Marco Grangetto This is my paper

Pith reviewed 2026-05-25 07:39 UTC · model grok-4.3

classification 💻 cs.AI

keywords sparse autoencodersmechanistic interpretabilitymedical visionchest radiographsMedCLIPmonosemanticityCheXpert dataset

0 comments

The pith

Sparse autoencoders applied to MedCLIP produce neurons with higher monosemanticity and interpretability than the original features on chest X-ray data.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper applies sparse autoencoders designed for medical data to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. It develops an evaluation approach that uses correlation metrics, entropy measures, and automated neuron labeling through another model to assess how cleanly each neuron corresponds to a single concept. Tests on the CheXpert dataset indicate that the resulting MedSAE neurons are more monosemantic and easier to interpret than the raw MedCLIP representations. This work seeks to increase transparency in high-performing medical AI without altering its core architecture.

Core claim

Applying MedSAEs to MedCLIP's latent space decomposes the representations into neurons that exhibit greater monosemanticity and interpretability than the original MedCLIP features, as quantified by the evaluation framework of correlation metrics, entropy analyses, and MedGemma-driven automated naming on the CheXpert dataset.

What carries the argument

MedSAE, a sparse autoencoder trained on MedCLIP latent representations to extract more monosemantic components for medical image analysis.

If this is right

Individual MedSAE neurons can be inspected to reveal which clinical concepts MedCLIP relies on for its predictions.
The same evaluation framework can be reused on other medical vision-language models to compare their feature interpretability.
More interpretable representations support auditing and debugging of AI systems used in radiology workflows.
The approach offers a method to retain model performance while increasing clinical transparency.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Extending MedSAE training to additional imaging modalities such as CT could test whether the monosemanticity gains generalize beyond radiographs.
If MedSAE features map directly to clinical findings, they could allow targeted interventions to correct model errors in specific disease categories.
Incorporating MedSAE-style sparsity during initial model training might produce inherently more interpretable medical representations from the start.

Load-bearing premise

The proposed metrics of correlation, entropy, and automated naming reliably capture true monosemanticity and interpretability without introducing bias from the evaluation tools themselves.

What would settle it

A controlled study on a held-out medical imaging dataset where human radiologists rate MedSAE neurons as no more interpretable than raw MedCLIP features, or where the automated metrics fail to align with those human ratings.

read the original abstract

Artificial intelligence in healthcare requires models that are accurate and interpretable. We advance mechanistic interpretability in medical vision by applying Medical Sparse Autoencoders (MedSAEs) to the latent space of MedCLIP, a vision-language model trained on chest radiographs and reports. To quantify interpretability, we propose an evaluation framework that combines correlation metrics, entropy analyses, and automated neuron naming via the MedGemma foundation model. Experiments on the CheXpert dataset show that MedSAE neurons achieve higher monosemanticity and interpretability than raw MedCLIP features. Our findings bridge high-performing medical AI and transparency, offering a scalable step toward clinically reliable representations. The source code supporting the findings of this study is available at https://github.com/EIDOSLAB/MedSAE.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedSAE applies existing SAE methods to MedCLIP with a MedGemma naming pipeline, but the interpretability gains rest on unvalidated automated scoring.

read the letter

The paper fits sparse autoencoders to MedCLIP's latent space on chest X-ray data and reports that the resulting neurons show higher monosemanticity than the raw features, measured through correlation, entropy, and names generated by MedGemma. They run this on CheXpert and release the code. That combination is the concrete new piece: a domain-specific SAE run plus an automated naming step using another medical VLM. The code release is useful for anyone who wants to replicate or extend the setup. The rest follows standard SAE practice without new math or derivations. The evaluation is the soft spot. MedGemma names the neurons, yet nothing in the abstract shows calibration against human labels or checks whether the names track actual clinical concepts rather than MedGemma's own biases. If the naming step favors features that happen to be legible to MedGemma, the claimed improvement in interpretability could be partly circular. No quantitative results, error bars, or training details appear in the abstract, so the size of the effect and the robustness of the ablations are impossible to judge from the summary. The work is empirical and straightforward rather than theoretically novel. It targets researchers already working on mechanistic interpretability for medical vision-language models who need a worked example on chest radiographs. Readers who care about transparency requirements in clinical AI might extract a usable pipeline if the full experiments include human validation and clear ablations. I would send it to peer review so referees can examine the actual numbers, check the MedGemma dependency, and ask for a human baseline on the naming task.

Referee Report

2 major / 1 minor

Summary. The paper applies sparse autoencoders (MedSAEs) to the latent representations of MedCLIP, a vision-language model for chest radiographs, and introduces an evaluation framework combining correlation metrics, entropy analyses, and automated neuron naming via MedGemma. Experiments on CheXpert are claimed to demonstrate that MedSAE neurons exhibit higher monosemanticity and interpretability than raw MedCLIP features, with code released at the provided GitHub link.

Significance. If the central empirical claim holds under a validated evaluation protocol, the work would offer a concrete, scalable route to mechanistic interpretability for medical VLMs, potentially aiding clinical trust and debugging. The combination of SAEs with domain-specific automated naming is a natural extension of existing SAE literature to healthcare, but the absence of quantitative results, error bars, or ablation details in the abstract limits immediate assessment of effect sizes or robustness.

major comments (2)

[Abstract / Evaluation Framework] Abstract and evaluation framework (implied §4): the claim that MedSAE neurons achieve higher monosemanticity rests entirely on correlation metrics, entropy, and MedGemma-generated names, yet no human-annotation baseline, inter-rater agreement, or calibration against ground-truth clinical concepts is reported. This leaves open the possibility that the framework systematically favors features legible to MedGemma rather than objectively monosemantic ones.
[Experiments] Experiments section: the abstract states results on CheXpert but supplies no numerical values, confidence intervals, ablation controls (e.g., SAE sparsity levels, dictionary size), or comparison to alternative interpretability methods. Without these, it is impossible to judge whether the reported improvement is load-bearing or sensitive to post-hoc choices.

minor comments (1)

[Abstract] The abstract mentions 'source code supporting the findings' but does not specify which exact metrics, hyperparameters, or MedGemma prompts are released, hindering reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and indicate planned revisions.

read point-by-point responses

Referee: [Abstract / Evaluation Framework] Abstract and evaluation framework (implied §4): the claim that MedSAE neurons achieve higher monosemanticity rests entirely on correlation metrics, entropy, and MedGemma-generated names, yet no human-annotation baseline, inter-rater agreement, or calibration against ground-truth clinical concepts is reported. This leaves open the possibility that the framework systematically favors features legible to MedGemma rather than objectively monosemantic ones.

Authors: We acknowledge that human validation would provide stronger corroboration of monosemanticity. The framework relies on established quantitative metrics from the SAE literature (correlation with clinical labels and entropy for feature purity) plus automated naming with a medically aligned model (MedGemma) chosen precisely because it matches the domain. A full human-annotation study with inter-rater statistics lies outside the scope of the current work due to resource constraints. In revision we will add an explicit limitations paragraph discussing this point and outlining future calibration against ground-truth clinical concepts. revision: partial
Referee: [Experiments] Experiments section: the abstract states results on CheXpert but supplies no numerical values, confidence intervals, ablation controls (e.g., SAE sparsity levels, dictionary size), or comparison to alternative interpretability methods. Without these, it is impossible to judge whether the reported improvement is load-bearing or sensitive to post-hoc choices.

Authors: The Experiments section of the full manuscript reports numerical results, ablations over sparsity levels and dictionary sizes, and comparisons against raw features. The abstract, however, is a high-level summary and omits these specifics. We will revise the abstract to include representative quantitative values, mention the ablation controls, and note the comparison baseline so that effect sizes and robustness are immediately visible to readers. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical comparison uses external MedGemma and independent metrics without reduction to inputs.

full rationale

The paper applies MedSAEs to MedCLIP latents and reports higher monosemanticity/interpretability on CheXpert via correlation, entropy, and MedGemma naming. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the provided text. The evaluation framework is presented as a proposed method rather than a self-definitional loop, and MedGemma is an external model. This is a standard empirical claim with no load-bearing self-referential steps.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review yields no explicit free parameters, axioms, or invented entities; the evaluation depends on standard metrics and an external foundation model whose reliability is assumed.

pith-pipeline@v0.9.0 · 5669 in / 1079 out tokens · 19260 ms · 2026-05-25T07:39:42.756860+00:00 · methodology

discussion (0)

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Sparse autoencoders on EEG transformers identify three regimes of clinical concept encoding and reveal entanglements such as age-pathology confounding via a new steering selectivity metric.
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

TopK SAEs applied to EEG transformers extract clinical features, enable concept steering, and identify selectively steerable, entangled, and non-encoded regimes with a spectral decoder for physiological interpretation.
Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders
cs.LG 2026-05 unverdicted novelty 6.0

Sparse autoencoders on EEG transformers extract clinical features, identify three steering regimes, expose age-pathology entanglements and wrecking-ball failures, and map interventions to frequency spectra.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · cited by 1 Pith paper · 1 internal anchor

[1]

Deep learning models have achieved remarkable results in various medical tasks [1, 2]

INTRODUCTION The emergence of Artificial Intelligence (AI) in healthcare systems is revolutionizing the way patients are diagnosed, treated, and monitored. Deep learning models have achieved remarkable results in various medical tasks [1, 2]. As the vol- ume of medical data continues to grow, the size and complex- ity of neural network architectures have ...

work page
[2]

Assess monosemanticity

work page
[3]

1: The overall proposed pipeline

Automated Interpretability Activating images Training data MedCLIP Vision Encoder(frozen) MedSAE (trainable) Fig. 1: The overall proposed pipeline. (1) We first train Med- SAE from MedCLIP vision encoder and extract correspond- ing embeddings. (2) Then, we compute their Pearson correla- tion with one-hot encoded vector labels to identify MedSAE neurons-co...

work page
[4]

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

BACKGROUND AND RELA TED WORK 2.1. Sparse Autoencoders Mechanistic interpretability aims to decompose deep learn- ing models into human-understandable components by ana- lyzing their model activations. This includes labeling circuits arXiv:2510.26411v1 [cs.AI] 30 Oct 2025 and neurons using feature visualization techniques, as well as quantifying concept ac...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

METHODOLOGY We now present our methodology for extracting and evalu- ating interpretable latent representations using SAEs. As il- lustrated in Figure 1, our MedSAE pipeline comprises three main stages: (1) training SAEs on MedCLIP embeddings, (2) assessing neuron monosemanticity, and (3) performing auto- mated interpretability and neuron naming. Each ste...

work page
[6]

EXPERIMENTS AND RESULTS This section evaluates our previously introduced method, showing that MedSAEs effectively disentangle superposed representations in MedCLIP embeddings, revealing struc- tured and clinically meaningful features. These results sup- port the potential of SAEs as a practical tool for mecha- nistic interpretability in vision-language mo...

work page
[7]

CONCLUSION This work presents a first step toward mechanistic inter- pretability in medical vision-language models by leveraging sparse autoencoders to extract clinically meaningful and monosemantic features from MedCLIP embeddings. Our evaluation framework, combining correlation metrics and automated neuron naming via MedGEMMA, demonstrates improved inte...

work page
[8]

A survey on deep learning in med- ical image analysis,

Geert Litjens et al., “A survey on deep learning in med- ical image analysis,”Medical Image Analysis, vol. 42, pp. 60–88, 2017. 1

work page 2017
[9]

A survey on deep learning applied to medical images: from simple artificial neural networks to generative models,

P. Celard et al., “A survey on deep learning applied to medical images: from simple artificial neural networks to generative models,”Neural Computing and Applica- tions, vol. 35, no. 3, pp. 2291–2323, Jan 2023. 1

work page 2023
[10]

Multimodal large language models: A survey,

Jiayang Wu et al., “Multimodal large language models: A survey,” in2023 IEEE International Conference on Big Data (BigData), 2023, pp. 2247–2256. 1

work page 2023
[11]

Mechanistic in- terpretability for AI safety - a review,

Leonard Bereska and Stratis Gavves, “Mechanistic in- terpretability for AI safety - a review,”Transactions on Machine Learning Research, 2024, Survey Certifica- tion, Expert Certification. 1, 2

work page 2024
[12]

MedCLIP: Contrastive learning from unpaired medical images and text,

Zifeng Wang et al., “MedCLIP: Contrastive learning from unpaired medical images and text,” inProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, Eds., Abu Dhabi, United Arab Emirates, Dec. 2022, pp. 3876–3887, Association for Computational Linguistics. 1

work page 2022
[13]

Medgemma technical report,

Andrew Sellergren et al., “Medgemma technical report,”

work page
[14]

A mathematical frame- work for transformer circuits,

Nelson Elhage et al., “A mathematical frame- work for transformer circuits,”Transformer Circuits Thread, 2021, https://transformer- circuits.pub/2021/framework/index.html. 2

work page 2021
[15]

Towards monosemanticity: De- composing language models with dictionary learning,

Trenton Bricken et al., “Towards monosemanticity: De- composing language models with dictionary learning,” Transformer Circuits Thread, 2023, https://transformer- circuits.pub/2023/monosemantic-features/index.html. 2

work page 2023
[16]

Mechanistic understanding and validation of large AI models with SemanticLens,

Maximilian Dreyer et al., “Mechanistic understanding and validation of large AI models with SemanticLens,” Nature Machine Intelligence, pp. 1–14, Aug. 2025, Pub- lisher: Nature Publishing Group. 2

work page 2025
[17]

An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation,

Ahmed Abdulaal et al., “An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation,” 2024. 2

work page 2024
[18]

Learning biologically relevant features in a pathology foundation model using sparse autoencoders,

Nhat Minh Le et al., “Learning biologically relevant features in a pathology foundation model using sparse autoencoders,” inAdvancements In Medical Founda- tion Models: Explainability, Robustness, Security, and Beyond, 2024. 2

work page 2024
[19]

Interpreting CLIP with sparse lin- ear concept embeddings (spliCE),

Usha Bhalla et al., “Interpreting CLIP with sparse lin- ear concept embeddings (spliCE),” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2

work page 2024
[20]

Update on how we train saes,

Tom Conerly et al., “Update on how we train saes,” Transformer Circuits Thread, 2024. 2

work page 2024
[21]

Automatically interpreting millions of features in large language models,

Gonc ¸alo Santos Paulo et al., “Automatically interpreting millions of features in large language models,” inForty- second International Conference on Machine Learning,

work page
[22]

Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison,

Jeremy Irvin et al., “Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison,” inProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Appli- cations of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. 2019, AAA...

work page 2019
[23]

Scaling and evaluating sparse autoen- coders,

Leo Gao et al., “Scaling and evaluating sparse autoen- coders,” inThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025

[1] [1]

Deep learning models have achieved remarkable results in various medical tasks [1, 2]

INTRODUCTION The emergence of Artificial Intelligence (AI) in healthcare systems is revolutionizing the way patients are diagnosed, treated, and monitored. Deep learning models have achieved remarkable results in various medical tasks [1, 2]. As the vol- ume of medical data continues to grow, the size and complex- ity of neural network architectures have ...

work page

[2] [2]

Assess monosemanticity

work page

[3] [3]

1: The overall proposed pipeline

Automated Interpretability Activating images Training data MedCLIP Vision Encoder(frozen) MedSAE (trainable) Fig. 1: The overall proposed pipeline. (1) We first train Med- SAE from MedCLIP vision encoder and extract correspond- ing embeddings. (2) Then, we compute their Pearson correla- tion with one-hot encoded vector labels to identify MedSAE neurons-co...

work page

[4] [4]

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

BACKGROUND AND RELA TED WORK 2.1. Sparse Autoencoders Mechanistic interpretability aims to decompose deep learn- ing models into human-understandable components by ana- lyzing their model activations. This includes labeling circuits arXiv:2510.26411v1 [cs.AI] 30 Oct 2025 and neurons using feature visualization techniques, as well as quantifying concept ac...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

METHODOLOGY We now present our methodology for extracting and evalu- ating interpretable latent representations using SAEs. As il- lustrated in Figure 1, our MedSAE pipeline comprises three main stages: (1) training SAEs on MedCLIP embeddings, (2) assessing neuron monosemanticity, and (3) performing auto- mated interpretability and neuron naming. Each ste...

work page

[6] [6]

EXPERIMENTS AND RESULTS This section evaluates our previously introduced method, showing that MedSAEs effectively disentangle superposed representations in MedCLIP embeddings, revealing struc- tured and clinically meaningful features. These results sup- port the potential of SAEs as a practical tool for mecha- nistic interpretability in vision-language mo...

work page

[7] [7]

CONCLUSION This work presents a first step toward mechanistic inter- pretability in medical vision-language models by leveraging sparse autoencoders to extract clinically meaningful and monosemantic features from MedCLIP embeddings. Our evaluation framework, combining correlation metrics and automated neuron naming via MedGEMMA, demonstrates improved inte...

work page

[8] [8]

A survey on deep learning in med- ical image analysis,

Geert Litjens et al., “A survey on deep learning in med- ical image analysis,”Medical Image Analysis, vol. 42, pp. 60–88, 2017. 1

work page 2017

[9] [9]

A survey on deep learning applied to medical images: from simple artificial neural networks to generative models,

P. Celard et al., “A survey on deep learning applied to medical images: from simple artificial neural networks to generative models,”Neural Computing and Applica- tions, vol. 35, no. 3, pp. 2291–2323, Jan 2023. 1

work page 2023

[10] [10]

Multimodal large language models: A survey,

Jiayang Wu et al., “Multimodal large language models: A survey,” in2023 IEEE International Conference on Big Data (BigData), 2023, pp. 2247–2256. 1

work page 2023

[11] [11]

Mechanistic in- terpretability for AI safety - a review,

Leonard Bereska and Stratis Gavves, “Mechanistic in- terpretability for AI safety - a review,”Transactions on Machine Learning Research, 2024, Survey Certifica- tion, Expert Certification. 1, 2

work page 2024

[12] [12]

MedCLIP: Contrastive learning from unpaired medical images and text,

Zifeng Wang et al., “MedCLIP: Contrastive learning from unpaired medical images and text,” inProceed- ings of the 2022 Conference on Empirical Methods in Natural Language Processing, Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang, Eds., Abu Dhabi, United Arab Emirates, Dec. 2022, pp. 3876–3887, Association for Computational Linguistics. 1

work page 2022

[13] [13]

Medgemma technical report,

Andrew Sellergren et al., “Medgemma technical report,”

work page

[14] [14]

A mathematical frame- work for transformer circuits,

Nelson Elhage et al., “A mathematical frame- work for transformer circuits,”Transformer Circuits Thread, 2021, https://transformer- circuits.pub/2021/framework/index.html. 2

work page 2021

[15] [15]

Towards monosemanticity: De- composing language models with dictionary learning,

Trenton Bricken et al., “Towards monosemanticity: De- composing language models with dictionary learning,” Transformer Circuits Thread, 2023, https://transformer- circuits.pub/2023/monosemantic-features/index.html. 2

work page 2023

[16] [16]

Mechanistic understanding and validation of large AI models with SemanticLens,

Maximilian Dreyer et al., “Mechanistic understanding and validation of large AI models with SemanticLens,” Nature Machine Intelligence, pp. 1–14, Aug. 2025, Pub- lisher: Nature Publishing Group. 2

work page 2025

[17] [17]

An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation,

Ahmed Abdulaal et al., “An x-ray is worth 15 features: Sparse autoencoders for interpretable radiology report generation,” 2024. 2

work page 2024

[18] [18]

Learning biologically relevant features in a pathology foundation model using sparse autoencoders,

Nhat Minh Le et al., “Learning biologically relevant features in a pathology foundation model using sparse autoencoders,” inAdvancements In Medical Founda- tion Models: Explainability, Robustness, Security, and Beyond, 2024. 2

work page 2024

[19] [19]

Interpreting CLIP with sparse lin- ear concept embeddings (spliCE),

Usha Bhalla et al., “Interpreting CLIP with sparse lin- ear concept embeddings (spliCE),” inThe Thirty-eighth Annual Conference on Neural Information Processing Systems, 2024. 2

work page 2024

[20] [20]

Update on how we train saes,

Tom Conerly et al., “Update on how we train saes,” Transformer Circuits Thread, 2024. 2

work page 2024

[21] [21]

Automatically interpreting millions of features in large language models,

Gonc ¸alo Santos Paulo et al., “Automatically interpreting millions of features in large language models,” inForty- second International Conference on Machine Learning,

work page

[22] [22]

Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison,

Jeremy Irvin et al., “Chexpert: a large chest radiograph dataset with uncertainty labels and expert comparison,” inProceedings of the Thirty-Third AAAI Conference on Artificial Intelligence and Thirty-First Innovative Appli- cations of Artificial Intelligence Conference and Ninth AAAI Symposium on Educational Advances in Artificial Intelligence. 2019, AAA...

work page 2019

[23] [23]

Scaling and evaluating sparse autoen- coders,

Leo Gao et al., “Scaling and evaluating sparse autoen- coders,” inThe Thirteenth International Conference on Learning Representations, 2025. 3

work page 2025