Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

Andreas Brink-Kj{\ae}r; Anton Mosquera Storgaard; James Zou; Lars Kai Hansen; Magnus Guldberg Pedersen; Magnus Ruud Kj{\ae}r; Nick Williams; Radu Gatej; Rahul Thapa; Sadasivan Puthusserypady

REVIEW 2 major objections 2 minor 2 cited by

Sparse autoencoders reveal entangled clinical concepts in EEG foundation models, such as age and pathology confounding.

Reviewed by Pith at T0; open to challenge. T0 means a machine referee read the full paper against a public rubric. the ladder, T0–T4 →

Challenge this review Re-run · record.json Download PDF Read on arXiv ↗

T0 review · grok-4.3

2026-05-25 06:03 UTC pith:2QGFXL3K

load-bearing objection They apply TopK SAEs to three EEG transformers, add a target-vs-off-target steering metric, and map interventions to spectra, but the entanglement claims depend on a narrow clinical taxonomy that may miss confounders. the 2 major comments →

arxiv 2605.13930 v3 pith:2QGFXL3K submitted 2026-05-13 cs.LG cs.HCcs.NE

Mechanistic Interpretability of EEG Foundation Models via Sparse Autoencoders

William Lehn-Schi{\o}ler , Magnus Ruud Kj{\ae}r , Rahul Thapa , Magnus Guldberg Pedersen , Anton Mosquera Storgaard , Nick Williams , Radu Gatej , Tue Lehn-Schi{\o}ler

show 5 more authors

Andreas Brink-Kj{\ae}r Sadasivan Puthusserypady S\'andor Beniczky James Zou Lars Kai Hansen

This is my paper

classification cs.LG cs.HCcs.NE

keywords EEG foundation modelssparse autoencodersmechanistic interpretabilityconcept steeringmonosemanticityclinical entanglementage-pathology confoundingspectral decoder

verification ladder T0 review T1 audit T2 compute T3 formal T4 reserved

The pith

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper uses TopK sparse autoencoders on embeddings from three different EEG foundation models to learn sparse dictionaries of features. These features are then evaluated against a clinical taxonomy of abnormality, age, sex, and medication to measure how cleanly each concept is represented. A single hyperparameter selection method based on dictionary health works across all models. Concept steering experiments identify features that can be changed selectively, those that are entangled, and those not present, while also showing interventions that destroy overall model performance. A decoder translates the feature changes into changes in brain wave frequency spectra.

Core claim

TopK SAEs extract features from EEG transformer embeddings that can be grounded in clinical concepts, revealing three regimes of encoding and exposing failures where concept steering either collapses global performance or entangles concepts like age and pathology such that one cannot be altered without the other.

What carries the argument

TopK Sparse Autoencoders that produce sparse feature dictionaries from model embeddings, paired with a target vs. off-target probe area metric for measuring steering selectivity.

Load-bearing premise

The clinical taxonomy of abnormality, age, sex, and medication is sufficient and unbiased for measuring how monosemantic the extracted features are.

What would settle it

A steering intervention on a pathology-labeled feature that alters age predictions in a manner inconsistent with the measured entanglement level would falsify the claim of quantifiable clinical entanglements.

Watch this falsifier — get emailed when new claim-graph text bears on it.

If this is right

Some concepts allow selective steering without off-target effects.
Other concepts are encoded but entangled, preventing isolated intervention.
Certain interventions act as wrecking balls that collapse model performance globally.
The spectral decoder maps latent feature changes to interpretable amplitude spectrum shifts like slow-wave suppression.
Clinical entanglements such as age-pathology confounding make independent suppression impossible.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Models may require additional training to disentangle clinical variables before deployment in targeted interventions.
The framework could help identify which features are safe to manipulate in clinical settings.
Extending the spectral mapping might allow direct prediction of EEG changes from model edits.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit.

Desk Editor's Note

They apply TopK SAEs to three EEG transformers, add a target-vs-off-target steering metric, and map interventions to spectra, but the entanglement claims depend on a narrow clinical taxonomy that may miss confounders.

read the letter

The paper takes TopK sparse autoencoders and runs them on the embeddings of SleepFM, REVE, and LaBraM. It grounds the resulting features in four clinical labels, measures how cleanly each feature aligns with one label versus the others, and tests steering by seeing how much changing one feature affects the target concept without hitting the others. A spectral decoder then turns those changes into frequency-band effects. That combination of SAE application, the selectivity metric, and the spectral readout is new for EEG foundation models. The pipeline is straightforward and the regimes they describe (selectively steerable, encoded but entangled, non-encoded) give a usable way to talk about what the models actually represent. The single hyperparameter procedure that works across architectures is also a practical plus if it holds up. The main soft spot is the clinical taxonomy itself. If age, pathology, sex, and medication are correlated with unmeasured factors such as recording site, sleep stage distribution, or comorbidities, then the reported entanglements and wrecking-ball effects could be artifacts of incomplete labeling rather than intrinsic model structure. The abstract does not show ablations on richer taxonomies or alternative health metrics, so it is hard to tell how robust the findings are. The work is worth a serious referee if the full manuscript contains the ablation tables, error bars, and dataset details needed to check those points. People building or deploying EEG models would get value from seeing whether the steering results replicate.

Referee Report

2 major / 2 minor

Summary. The paper applies TopK Sparse Autoencoders to the embeddings of three EEG foundation models (SleepFM, REVE, LaBraM) to extract sparse feature dictionaries. These features are grounded in a four-concept clinical taxonomy (abnormality, age, sex, medication) to measure monosemanticity and entanglement. A single intrinsic dictionary-health hyperparameter procedure is used across architectures. Concept steering is performed with a new 'target vs. off-target' probe area metric that identifies three regimes (selectively steerable, encoded but entangled, non-encoded). The work reports 'wrecking-ball' interventions that collapse performance and specific clinical entanglements (e.g., age-pathology confounding), with a spectral decoder translating interventions into frequency-domain signatures.

Significance. If the empirical results and metric definitions hold under the stated taxonomy, the framework supplies a concrete, transferable auditing procedure for representational quality in clinical EEG models and directly links latent interventions to physiologically interpretable spectral changes. The architecture-agnostic hyperparameter procedure and the steering selectivity metric are potentially reusable contributions.

major comments (2)

[clinical taxonomy grounding and steering results] The central claims of 'encoded but entangled' regimes and age-pathology confounding (abstract and the steering results section) rest on the four-concept taxonomy supplying a sufficient, unbiased basis for monosemanticity measurement. The manuscript does not report controls for unmeasured confounders (recording site, sleep-stage distributions, or comorbidities) that could induce the observed steering non-selectivity; without such checks the entanglement findings risk being artifacts of taxonomy incompleteness rather than intrinsic model structure.
[hyperparameter procedure and cross-architecture results] The claim that a single intrinsic dictionary-health hyperparameter procedure 'transfers robustly across all three architectures' (abstract and methods) is load-bearing for the cross-model generality result. No ablation is shown against alternative health metrics or an expanded taxonomy, leaving open whether TopK feature stability is an artifact of the chosen audit rather than a general property.

minor comments (2)

[steering selectivity metric] The definition and computation of the 'target vs. off-target probe area metric' should be given explicitly with a formula or pseudocode, including how the area is normalized and how statistical significance is assessed.
[spectral decoder figures] Figure captions and axis labels for the spectral decoder outputs should explicitly state the frequency bands corresponding to 'pathological slow-wave suppression' and 'α-band restoration' so readers can map them to standard EEG conventions without ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important considerations for the robustness of our taxonomy and hyperparameter choices. We respond to each major comment below.

read point-by-point responses

Referee: The central claims of 'encoded but entangled' regimes and age-pathology confounding (abstract and the steering results section) rest on the four-concept taxonomy supplying a sufficient, unbiased basis for monosemanticity measurement. The manuscript does not report controls for unmeasured confounders (recording site, sleep-stage distributions, or comorbidities) that could induce the observed steering non-selectivity; without such checks the entanglement findings risk being artifacts of taxonomy incompleteness rather than intrinsic model structure.

Authors: We agree that the four-concept taxonomy does not include explicit controls for unmeasured confounders such as recording site, sleep-stage distributions, or comorbidities, and the manuscript does not report such checks. The available dataset annotations are limited to the four concepts, so additional controls would require new data or metadata not present in the public releases. The observed entanglements are consistent across three independent models, which supports that they reflect representational properties, but we acknowledge this does not fully rule out dataset artifacts. We will add a limitations subsection discussing the taxonomy scope and the potential influence of unmeasured confounders on steering selectivity. revision: yes
Referee: The claim that a single intrinsic dictionary-health hyperparameter procedure 'transfers robustly across all three architectures' (abstract and methods) is load-bearing for the cross-model generality result. No ablation is shown against alternative health metrics or an expanded taxonomy, leaving open whether TopK feature stability is an artifact of the chosen audit rather than a general property.

Authors: The dictionary-health metric is intrinsic to the SAE optimization and does not depend on the clinical taxonomy labels, which is the basis for claiming transfer without per-model retuning. We demonstrate this empirically on three architecturally distinct models. We did not include ablations against alternative health metrics or an expanded taxonomy. We will revise the methods section to elaborate on the metric's selection rationale from prior SAE literature and to note the absence of such ablations as a limitation and direction for future work. revision: partial

Circularity Check

0 steps flagged

Empirical pipeline with independent definitions; no circular reductions

full rationale

The paper applies TopK SAEs to extract features from EEG model embeddings, grounds them in an external clinical taxonomy (abnormality, age, sex, medication), defines a dictionary-health hyperparameter procedure, and introduces a target vs. off-target probe area metric to identify steering regimes. These steps are operational and benchmarked across architectures without any quoted equation or procedure reducing a reported quantity to a fitted input or self-citation by construction. The central claims rest on empirical measurements rather than definitional equivalence.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract supplies no explicit free parameters, axioms, or invented entities; all claims rest on the unstated assumption that the SAE dictionary and clinical taxonomy are appropriate.

pith-pipeline@v0.9.0 · 5825 in / 1115 out tokens · 29111 ms · 2026-05-25T06:03:07.221486+00:00 · methodology

0 comments

read the original abstract

EEG foundation models achieve state-of-the-art clinical performance, yet the internal computations driving their predictions remain opaque: a barrier to clinical trust. We apply TopK Sparse Autoencoders (SAEs) across three architecturally distinct EEG transformers: SleepFM, REVE, and LaBraM to extract sparse feature dictionaries from their embeddings. By grounding these features in a clinical taxonomy (abnormality, age, sex, and medication), we benchmark monosemanticity and entanglement across architectures. A single hyperparameter procedure, driven by an intrinsic dictionary health audit, transfers robustly across all three architectures. Via concept steering, we introduce a "target vs. off-target" probe area metric to quantify steering selectivity and reveal three operational regimes: selectively steerable, encoded but entangled, and non-encoded. This framework exposes critical representational failures: "wrecking-ball" interventions that collapse global model performance, and clinical entanglements, such as age-pathology confounding, where it is impossible to suppress one concept without corrupting the other. Finally, a spectral decoder maps these interventions back to the amplitude spectrum, translating latent manipulations into physiologically interpretable frequency signatures, such as pathological slow-wave suppression and $\alpha$-band restoration.

Figures

Figures reproduced from arXiv: 2605.13930 by Andreas Brink-Kj{\ae}r, Anton Mosquera Storgaard, James Zou, Lars Kai Hansen, Magnus Guldberg Pedersen, Magnus Ruud Kj{\ae}r, Nick Williams, Radu Gatej, Rahul Thapa, Sadasivan Puthusserypady, S\'andor Beniczky, Tue Lehn-Schi{\o}ler, William Lehn-Schi{\o}ler.

**Figure 1.** Figure 1: Pipeline overview. Starting from a frozen EEG foundation model: (Stage I) A shallow MLP spectral decoder translates token embeddings back into a human interpretable space. (Stage II) For each transformer layer, a TopK SAE recovers a sparse, over-complete feature dictionary from normalized encoder activations. (Stage III) SAE features are mapped to known clinical concepts using TCAV. (Stage IV) Concept stee… view at source ↗

**Figure 2.** Figure 2: SAE-faithfulness layer sweep. Test AUROC of a linear probe trained via 5-fold crossvalidation on mean-pooled embeddings of each finetuned encoder. During inference, layer-ℓ activations are replaced by their TopK-SAE reconstructions as ℓ sweeps through every transformer block. Shaded bands represent 95% confidence intervals across the CV folds; the dotted horizontal lines indicate the no-SAE baseline mean… view at source ↗

**Figure 3.** Figure 3: Monosemanticity taxonomy across SAE expansion and encoder depth. Each cell reports the fraction of concept-enriched SAE features in one of three taxonomy classes (Separable: monosemantic; Entangled: polysemantic co-activations; Dead: semantically uninformative/inactive). Columns represent encoders (SleepFM, LaBraM, REVE), with x-axes indexing the encoder layer and y-axes indexing expansion factor E ∈ {1, 2… view at source ↗

**Figure 4.** Figure 4: Concept encoding strength and steering selectivity. Top: Encoding strength (AUROC0) measured via per-layer linear probes fit to the clean SAE-decoded reconstructions. Bottom: Excess selectivity (∆˜ ), quantifying the integrated asymmetry between target and off-target probe degradation under TCAV-ranked clamping (Section 3.5). For abnormality as target, we use age as off-target. For all other targets, we us… view at source ↗

**Figure 5.** Figure 5: Steering sweeps across the encoding–selectivity landscape. Nine representative configurations from [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: grounds the abstract selectivity metrics ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

**Figure 7.** Figure 7: SAE dictionary size across encoders and expansion rates. Each cell gives the number of learned SAE features, which equals the encoder’s embedding dimension (denc = 128 for SleepFM, 200 for LaBraM, 512 for REVE) times the expansion rate E. Because REVE is 4× wider than SleepFM, an E=1 REVE SAE already exceeds the size of an E=4 SleepFM SAE, and the E=64 REVE configuration spans 32,768 features. This asymmet… view at source ↗

Review history (3 revisions) →

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

ECG-InterpBench: Benchmarking the Interpretability of ECG Foundation Models with Matched-Scale Sparse Autoencoders
cs.LG 2026-07 conditional novelty 6.0

With matched-scale sparse autoencoders, HuBERT-ECG best preserves its ECG representation while ECG-JEPA best exposes clinical measurements through single features — a leader split that repeats on MIMIC-IV-ECG.
Foundation Models for EEG Are Blind to Long-Range Temporal Correlations: A Spectral-Temporal Dissociation Behind Their Cross-Population Fragility
q-bio.NC 2026-07 conditional novelty 6.0

EEG foundation models fail to encode the alpha-envelope DFA exponent, a disease-relevant temporal-scaling feature, while spectral-input models still encode the static 1/f slope.

Reference graph

Works this paper leans on

31 extracted references · 31 canonical work pages · cited by 2 Pith papers · 8 internal anchors

[1]

doi: 10.1038/s41591-025-04133-4

Rahul Thapa, Magnus Ruud Kjaer, Bryan He, Ian Covert, Hyatt Moore IV , Umaer Hanif, Gauri Ganjoo, M. Brandon Westover, Poul Jennum, Andreas Brink-Kjaer, Emmanuel Mignot, and James Zou. A multimodal sleep foundation model for disease prediction.Nature Medicine, 32: 752–762, 2026. doi: 10.1038/s41591-025-04133-4

work page doi:10.1038/s41591-025-04133-4 2026
[2]

REVE: A foundation model for EEG: Adapting to any setup with large-scale pretraining on 25,000 subjects.Advances in Neural Information Processing Systems, 2025

Yassine El Ouahidi, Jonathan Lys, Philipp Thölke, Nicolas Farrugia, Bastien Pasdeloup, Vincent Gripon, Karim Jerbi, and Giulia Lioi. REVE: A foundation model for EEG: Adapting to any setup with large-scale pretraining on 25,000 subjects.Advances in Neural Information Processing Systems, 2025. URLhttps://brain-bzh.github.io/reve/

work page 2025
[3]

Large brain model for learning generic representations with tremendous EEG data in BCI

Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. Large brain model for learning generic representations with tremendous EEG data in BCI. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024
[4]

BENDR: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data

Demetres Kostas, Stéphane Aroca-Ouellette, and Frank Rudzicz. BENDR: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data. Frontiers in Human Neuroscience, 15, 2021

work page 2021
[5]

Pretraining on Sleep Data Improves non-Sleep Biosignal Tasks

William Lehn-Schiøler, Magnus Ruud Kjær, Phillip Hempel, Magnus Guldberg Pedersen, Rahul Thapa, Bryan He, Nicolai Spicher, Andreas Brink-Kjaer, Lars Kai Hansen, and Em- manuel Mignot. Pretraining on sleep data improves non-sleep biosignal tasks.arXiv preprint arXiv:2605.02500, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[6]

Standardized computer-based organized reporting of EEG: SCORE – second version.Clinical Neurophysiology, 128(11):2334–2346, 2017

Sándor Beniczky, Harald Aurlien, Jan C Brøgger, Lawrence J Hirsch, Donald L Schomer, Eugen Trinka, et al. Standardized computer-based organized reporting of EEG: SCORE – second version.Clinical Neurophysiology, 128(11):2334–2346, 2017. doi: 10.1016/j.clinph.2017.07. 418

work page doi:10.1016/j.clinph.2017.07 2017
[7]

A mathematical framework for transformer circuits.Transformer Circuits Thread,

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page
[8]

URLhttps://transformer-circuits.pub/2021/framework/index.html

work page 2021
[9]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage et al. Toy models of superposition.Transformer Circuits Thread, 2022

work page 2022
[10]

Towards monosemanticity: Decomposing language models with dictio- nary learning.Transformer Circuits Thread, 2023

Trenton Bricken et al. Towards monosemanticity: Decomposing language models with dictio- nary learning.Transformer Circuits Thread, 2023

work page 2023
[11]

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024

Adly Templeton et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/ 2024/scaling-monosemanticity/

work page 2024
[12]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Tom Lieberum et al. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[13]

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, and Marco Grangetto. Medsae: Dissecting medclip representations with sparse autoencoders.arXiv preprint arXiv:2510.26411, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[14]

Mammo-sae: Interpreting breast cancer concept learning with sparse autoencoders, 2025

Krishna Kanth Nakka. Mammo-sae: Interpreting breast cancer concept learning with sparse autoencoders, 2025

work page 2025
[15]

Interplm: discovering interpretable features in protein language models via sparse autoencoders.Nature Methods, 22(10):2107–2117, 2025

Elana Simon and James Zou. Interplm: discovering interpretable features in protein language models via sparse autoencoders.Nature Methods, 22(10):2107–2117, 2025

work page 2025
[16]

Beyond black boxes: Enhancing interpretability of transformers trained on neural data, 2025

Laurence Freeman, Philip Shamash, Vinam Arora, Caswell Barry, Tiago Branco, and Eva Dyer. Beyond black boxes: Enhancing interpretability of transformers trained on neural data, 2025

work page 2025
[17]

Mechanistic inter- pretability for transformer-based time series classification

Mat¯ıss Kaln¯are, Sofoklis Kitharidis, Thomas Bäck, and Niki van Stein. Mechanistic inter- pretability for transformer-based time series classification. InComputational Intelligence. IJCCI 2025, volume 2829 ofCommunications in Computer and Information Science. Springer,

work page 2025
[18]

doi: 10.1007/978-3-032-15638-9_15. 11

work page doi:10.1007/978-3-032-15638-9_15
[19]

k-sparse autoencoders

Alireza Makhzani and Brendan Frey. k-sparse autoencoders. InInternational Conference on Learning Representations (ICLR), 2014

work page 2014
[20]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham et al. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018
[22]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022
[23]

BEiT: BERT pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022
[24]

Interpretability beyond classification accuracy: Quantitative testing with concept activation vectors (TCA V)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, Jesse Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond classification accuracy: Quantitative testing with concept activation vectors (TCA V). InProceedings of the 35th International Conference on Machine Learning (ICML), 2018

work page 2018
[25]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[26]

Probing Classifiers: Promises, Shortcomings, and Advances

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. doi: 10.1162/coli_a_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022
[27]

Concept-based explainability for an eeg transformer model

Anders Gjølbye Madsen, William Theodor Lehn-Schiøler, Áshildur Jónsdóttir, Bergdís Arnardóttir, and Lars Kai Hansen. Concept-based explainability for an eeg transformer model. In2023 IEEE 33rd International Workshop on Machine Learning for Signal Pro- cessing (MLSP), pages 1–6. IEEE, September 2023. doi: 10.1109/mlsp55844.2023.10285992. URLhttp://dx.doi.o...

work page doi:10.1109/mlsp55844.2023.10285992 2023
[28]

Nomin Enkhtsetseg, William Lehn-Schiøler, Anton Storgaard Mosquera, Magnus Guldberg Ped- ersen, Dylan Rice, George Wambugu, Nshimiyimana Jules Fidele, Melita Cacic Hribljan, Anca Alina Arbune, Sidsel Armand Larsen, Sandor Beniczky, and Farrah J. Mateen. Clinical utility and feasibility of smartphone-based EEG in kenya: A multicenter observational study. a...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[29]

SPEED: Scalable preprocessing of EEG data for self-supervised learning

Anders Gjølbye, Lina Skerath, William Lehn-Schiøler, Nicolas Langer, and Lars Kai Hansen. SPEED: Scalable preprocessing of EEG data for self-supervised learning. InProceedings of the 2024 IEEE International Workshop on Machine Learning for Signal Processing, 2024

work page 2024
[30]

LEACE: Perfect linear concept erasure in closed form

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE: Perfect linear concept erasure in closed form. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023
[31]

Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 2021

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 2021. 12 A Technical appendices and supplementary material Table 3: Notation reference. Symbol Type / shape Definition First used Encoder dscalar∈Z + E...

work page 2021

[1] [1]

doi: 10.1038/s41591-025-04133-4

Rahul Thapa, Magnus Ruud Kjaer, Bryan He, Ian Covert, Hyatt Moore IV , Umaer Hanif, Gauri Ganjoo, M. Brandon Westover, Poul Jennum, Andreas Brink-Kjaer, Emmanuel Mignot, and James Zou. A multimodal sleep foundation model for disease prediction.Nature Medicine, 32: 752–762, 2026. doi: 10.1038/s41591-025-04133-4

work page doi:10.1038/s41591-025-04133-4 2026

[2] [2]

REVE: A foundation model for EEG: Adapting to any setup with large-scale pretraining on 25,000 subjects.Advances in Neural Information Processing Systems, 2025

Yassine El Ouahidi, Jonathan Lys, Philipp Thölke, Nicolas Farrugia, Bastien Pasdeloup, Vincent Gripon, Karim Jerbi, and Giulia Lioi. REVE: A foundation model for EEG: Adapting to any setup with large-scale pretraining on 25,000 subjects.Advances in Neural Information Processing Systems, 2025. URLhttps://brain-bzh.github.io/reve/

work page 2025

[3] [3]

Large brain model for learning generic representations with tremendous EEG data in BCI

Wei-Bang Jiang, Li-Ming Zhao, and Bao-Liang Lu. Large brain model for learning generic representations with tremendous EEG data in BCI. InInternational Conference on Learning Representations (ICLR), 2024

work page 2024

[4] [4]

BENDR: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data

Demetres Kostas, Stéphane Aroca-Ouellette, and Frank Rudzicz. BENDR: Using transformers and a contrastive self-supervised learning task to learn from massive amounts of EEG data. Frontiers in Human Neuroscience, 15, 2021

work page 2021

[5] [5]

Pretraining on Sleep Data Improves non-Sleep Biosignal Tasks

William Lehn-Schiøler, Magnus Ruud Kjær, Phillip Hempel, Magnus Guldberg Pedersen, Rahul Thapa, Bryan He, Nicolai Spicher, Andreas Brink-Kjaer, Lars Kai Hansen, and Em- manuel Mignot. Pretraining on sleep data improves non-sleep biosignal tasks.arXiv preprint arXiv:2605.02500, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[6] [6]

Standardized computer-based organized reporting of EEG: SCORE – second version.Clinical Neurophysiology, 128(11):2334–2346, 2017

Sándor Beniczky, Harald Aurlien, Jan C Brøgger, Lawrence J Hirsch, Donald L Schomer, Eugen Trinka, et al. Standardized computer-based organized reporting of EEG: SCORE – second version.Clinical Neurophysiology, 128(11):2334–2346, 2017. doi: 10.1016/j.clinph.2017.07. 418

work page doi:10.1016/j.clinph.2017.07 2017

[7] [7]

A mathematical framework for transformer circuits.Transformer Circuits Thread,

Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, Nova DasSarma, Dawn Drain, Deep Ganguli, Zac Hatfield-Dodds, Danny Hernandez, Andy Jones, Jackson Kernion, Liane Lovitt, Kamal Ndousse, Dario Amodei, Tom Brown, Jack Clark, Jared Kaplan, Sam McCandlish, and Chris Olah. A...

work page

[8] [8]

URLhttps://transformer-circuits.pub/2021/framework/index.html

work page 2021

[9] [9]

Toy models of superposition.Transformer Circuits Thread, 2022

Nelson Elhage et al. Toy models of superposition.Transformer Circuits Thread, 2022

work page 2022

[10] [10]

Towards monosemanticity: Decomposing language models with dictio- nary learning.Transformer Circuits Thread, 2023

Trenton Bricken et al. Towards monosemanticity: Decomposing language models with dictio- nary learning.Transformer Circuits Thread, 2023

work page 2023

[11] [11]

Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024

Adly Templeton et al. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet.Transformer Circuits Thread, 2024. URL https://transformer-circuits.pub/ 2024/scaling-monosemanticity/

work page 2024

[12] [12]

Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

Tom Lieberum et al. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[13] [13]

MedSAE: Dissecting MedCLIP Representations with Sparse Autoencoders

Riccardo Renzulli, Colas Lepoutre, Enrico Cassano, and Marco Grangetto. Medsae: Dissecting medclip representations with sparse autoencoders.arXiv preprint arXiv:2510.26411, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[14] [14]

Mammo-sae: Interpreting breast cancer concept learning with sparse autoencoders, 2025

Krishna Kanth Nakka. Mammo-sae: Interpreting breast cancer concept learning with sparse autoencoders, 2025

work page 2025

[15] [15]

Interplm: discovering interpretable features in protein language models via sparse autoencoders.Nature Methods, 22(10):2107–2117, 2025

Elana Simon and James Zou. Interplm: discovering interpretable features in protein language models via sparse autoencoders.Nature Methods, 22(10):2107–2117, 2025

work page 2025

[16] [16]

Beyond black boxes: Enhancing interpretability of transformers trained on neural data, 2025

Laurence Freeman, Philip Shamash, Vinam Arora, Caswell Barry, Tiago Branco, and Eva Dyer. Beyond black boxes: Enhancing interpretability of transformers trained on neural data, 2025

work page 2025

[17] [17]

Mechanistic inter- pretability for transformer-based time series classification

Mat¯ıss Kaln¯are, Sofoklis Kitharidis, Thomas Bäck, and Niki van Stein. Mechanistic inter- pretability for transformer-based time series classification. InComputational Intelligence. IJCCI 2025, volume 2829 ofCommunications in Computer and Information Science. Springer,

work page 2025

[18] [18]

doi: 10.1007/978-3-032-15638-9_15. 11

work page doi:10.1007/978-3-032-15638-9_15

[19] [19]

k-sparse autoencoders

Alireza Makhzani and Brendan Frey. k-sparse autoencoders. InInternational Conference on Learning Representations (ICLR), 2014

work page 2014

[20] [20]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham et al. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[21] [21]

Representation Learning with Contrastive Predictive Coding

Aaron van den Oord, Yazhe Li, and Oriol Vinyals. Representation learning with contrastive predictive coding.arXiv preprint arXiv:1807.03748, 2018

work page internal anchor Pith review Pith/arXiv arXiv 2018

[22] [22]

Masked autoencoders are scalable vision learners

Kaiming He, Xinlei Chen, Saining Xie, Yanghao Li, Piotr Dollár, and Ross Girshick. Masked autoencoders are scalable vision learners. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2022

work page 2022

[23] [23]

BEiT: BERT pre-training of image transformers

Hangbo Bao, Li Dong, Songhao Piao, and Furu Wei. BEiT: BERT pre-training of image transformers. InInternational Conference on Learning Representations (ICLR), 2022

work page 2022

[24] [24]

Interpretability beyond classification accuracy: Quantitative testing with concept activation vectors (TCA V)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, Jesse Wexler, Fernanda Viegas, and Rory Sayres. Interpretability beyond classification accuracy: Quantitative testing with concept activation vectors (TCA V). InProceedings of the 35th International Conference on Machine Learning (ICML), 2018

work page 2018

[25] [25]

Understanding intermediate layers using linear classifier probes

Guillaume Alain and Yoshua Bengio. Understanding intermediate layers using linear classifier probes.arXiv preprint arXiv:1610.01644, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017

[26] [26]

Probing Classifiers: Promises, Shortcomings, and Advances

Yonatan Belinkov. Probing classifiers: Promises, shortcomings, and advances.Computational Linguistics, 48(1):207–219, 2022. doi: 10.1162/coli_a_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2022

[27] [27]

Concept-based explainability for an eeg transformer model

Anders Gjølbye Madsen, William Theodor Lehn-Schiøler, Áshildur Jónsdóttir, Bergdís Arnardóttir, and Lars Kai Hansen. Concept-based explainability for an eeg transformer model. In2023 IEEE 33rd International Workshop on Machine Learning for Signal Pro- cessing (MLSP), pages 1–6. IEEE, September 2023. doi: 10.1109/mlsp55844.2023.10285992. URLhttp://dx.doi.o...

work page doi:10.1109/mlsp55844.2023.10285992 2023

[28] [28]

Nomin Enkhtsetseg, William Lehn-Schiøler, Anton Storgaard Mosquera, Magnus Guldberg Ped- ersen, Dylan Rice, George Wambugu, Nshimiyimana Jules Fidele, Melita Cacic Hribljan, Anca Alina Arbune, Sidsel Armand Larsen, Sandor Beniczky, and Farrah J. Mateen. Clinical utility and feasibility of smartphone-based EEG in kenya: A multicenter observational study. a...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[29] [29]

SPEED: Scalable preprocessing of EEG data for self-supervised learning

Anders Gjølbye, Lina Skerath, William Lehn-Schiøler, Nicolas Langer, and Lars Kai Hansen. SPEED: Scalable preprocessing of EEG data for self-supervised learning. InProceedings of the 2024 IEEE International Workshop on Machine Learning for Signal Processing, 2024

work page 2024

[30] [30]

LEACE: Perfect linear concept erasure in closed form

Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. LEACE: Perfect linear concept erasure in closed form. InAdvances in Neural Information Processing Systems (NeurIPS), 2023

work page 2023

[31] [31]

Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 2021

Yanai Elazar, Shauli Ravfogel, Alon Jacovi, and Yoav Goldberg. Amnesic probing: Behavioral explanation with amnesic counterfactuals.Transactions of the Association for Computational Linguistics, 9:160–175, 2021. 12 A Technical appendices and supplementary material Table 3: Notation reference. Symbol Type / shape Definition First used Encoder dscalar∈Z + E...

work page 2021