pith. sign in

arxiv: 2605.21849 · v1 · pith:U5LFUISKnew · submitted 2026-05-21 · 💻 cs.LG · cs.CL

Geometry-Adaptive Explainer for Faithful Dictionary-Based Interpretability under Distribution Shift

Pith reviewed 2026-05-22 08:11 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords mechanistic interpretabilitydictionary learningdistribution shiftsparse autoencodersout-of-distributioncausal faithfulnessgeometry adaptive explainer
0
0 comments X

The pith

Realigning an ID-trained dictionary to the model's OOD-active subspace restores faithfulness without retraining or labels.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Distribution shift rotates the subspace a model actively uses, which misaligns dictionary-based explainers trained only on in-distribution activations and increases their faithfulness gap. The paper defines this gap geometrically and introduces the Geometry-Adaptive Explainer that rotates the existing dictionary to match the new OOD subspace while leaving the original feature directions intact. The method needs only unlabeled OOD activations and performs no gradient steps. A proof shows the resulting excess loss is bounded quadratically by the second-moment shift between distributions. Experiments across models and shift types show the adapted explainer matches or exceeds fully retrained baselines on causal faithfulness metrics.

Core claim

The Geometry-Adaptive Explainer (GAE) realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. GAE reduces the faithfulness gap, with excess loss bounded quadratically by the second-moment shift, and empirically matches or surpasses training-based baselines in causal faithfulness across multiple models and OOD settings.

What carries the argument

Geometry-Adaptive Explainer (GAE), which rotates the ID dictionary onto the OOD-active subspace to close the geometric faithfulness gap while keeping feature structure fixed.

If this is right

  • The faithfulness gap equals the geometric distance between the ID dictionary and the OOD-active subspace and directly controls OOD degradation.
  • GAE improves over the unadapted ID explainer with excess loss bounded quadratically by the second-moment shift.
  • GAE achieves or exceeds the causal faithfulness of all training-based baselines while using only unlabeled OOD activations.
  • Realignment can be performed without changing the semantic meaning of the learned features.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same subspace-realignment step could be applied to other dictionary-style interpretability tools when data distributions drift.
  • GAE may lower the cost of maintaining explanations in deployed systems that encounter gradual distribution shift.
  • Testing the quadratic bound on larger or adversarial shifts would clarify the practical range of the guarantee.

Load-bearing premise

Realigning the dictionary to the OOD-active subspace can be done while preserving the original feature structure and suffices to control faithfulness without gradient-based optimization or labeled data.

What would settle it

Measure causal faithfulness on held-out OOD activations for the unadapted ID dictionary, the GAE-adjusted dictionary, and a fully retrained dictionary; if GAE fails to reduce the gap relative to the unadapted version or to match the retrained version, the central claim does not hold.

Figures

Figures reproduced from arXiv: 2605.21849 by Andrew Lee, Heedong Kim, Kyungwoo Song, Sungjun Lim.

Figure 1
Figure 1. Figure 1: Faithfulness gap and GAE. Left: distribution shift (illustrated as a language change) rotates the OOD-active subspace ΠOOD away from the ID-trained explainer subspace Πdec ≈ ΠID, opening a faithfulness gap ∆(Πdec). Right: GAE closes this gap in two steps. Step 1 rotates Πdec onto ΠOOD via orthogonal Procrustes. Step 2 refits individual feature directions within the aligned subspace to match OOD activations… view at source ↗
Figure 2
Figure 2. Figure 2: Controlled experiment on a toy MLP with OOD severity varied from 0 (ID) to 1 (maxi￾mum shift). (a) The Fixed explainer’s faithfulness gap ∆(Πdec) grows monotonically. (b) Its reconstruction error rises accordingly. GAE maintains near-zero gap and flat error throughout. We first test whether the geometric mech￾anism from Section 3 holds in a controlled setting. We train a 2-layer ReLU MLP with hidden dim d=… view at source ↗
Figure 3
Figure 3. Figure 3: Per-feature DLA on a prompt predicting ‘ American’ (GPT-2, Transcoder). Both methods share the same encoder and top-3 features; only the decoder columns differ. Each cell shows a feature’s direct logit attribution (DLA) to nationality tokens (left, 20 tokens) vs. non-nationality controls (right, 10 tokens). Fixed’s total class-specificity is −0.55 (circuit points away from the target class); GAE’s is +1.39… view at source ↗
Figure 4
Figure 4. Figure 4: Mechanism analysis (GPT-2, Transcoder). (a) Sorted principal angles between each explainer’s top-r subspace and ΠbOOD. GAE’s subspace aligns with ΠbOOD, while Fixed and Finetune leave large angular gaps. (b) Step ablation: Step 1 closes the faithfulness gap to 0 yet drops nComp from 0.74 to 0.44. Step 2 restores nComp to 0.96 at the cost of a small gap (1.59). Subspace alignment [PITH_FULL_IMAGE:figures/f… view at source ↗
Figure 5
Figure 5. Figure 5: Explainer subspace overlap between the explainer and the ID-active subspace (solid) vs. the OOD-active subspace (dashed), as a function of OOD severity s, for dictionary sizes k ∈ {d/2, 1d, 2d, 4d, 8d, 32d}. Left: Transcoder. Right: SAE. For k ≥ 4d, both explainer types maintain high ID overlap (> 0.89) regardless of severity, while OOD overlap degrades monotonically. Transcoder (left). The explainer–ID ov… view at source ↗
Figure 6
Figure 6. Figure 6: reports the subspace overlap for ID-trained explainers on GPT-2 Small and Pythia-1.4B under temporal, domain, and adversarial shifts. The rank is r=64 for both models. GPT-2 Pythia-1.4B 0.0 0.2 0.4 0.6 0.8 1.0 Subspace Overlap 0.73 0.70 0.25 0.22 0.53 0.41 0.10 0.14 vs ID vs OOD (Temporal) vs OOD (Domain) vs OOD (Adversarial) (a) Transcoder GPT-2 Pythia-1.4B 0.0 0.2 0.4 0.6 0.8 1.0 Subspace Overlap 0.26 0.… view at source ↗
Figure 7
Figure 7. Figure 7: Explainer-dependent ratio η as a function of OOD severity s for dictionary sizes k ∈ {d/2, 1d, 2d, 4d, 8d, 32d}. At s=1.0, η ≈ 0.31 for both explainer types, independent of k. Transcoder (left). At pure OOD, domain and adversarial shifts reach η > 0.99 across both models: the explainer-dependent component dominates the total error almost entirely. Temporal shift yields η ≈ 0.66–0.99 depending on the model,… view at source ↗
Figure 8
Figure 8. Figure 8: Explainer-dependent ratio η at pure ID (hatched) and pure OOD (colored) (r=64). Under domain and adversarial shifts, η > 0.99 for both explainer types. B.4 Empirical Verification of Proposition 1 Proposition 1 predicts that second-moment shift controls the faithfulness gap via ∆(ΠID) ≤ √ 2 γID ∥MOOD − MID∥F . Since this bound depends only on MID and MOOD, it is independent of the explainer architecture and… view at source ↗
Figure 9
Figure 9. Figure 9: plots the normalized second-moment shift against ∆(ΠID), with color indicating OOD severity s. The two quantities are near-perfectly correlated (Pearson r=0.993, Spearman ρ=1.000), consistent with the linear upper bound. 0.0 0.1 0.2 0.3 0.4 0.5 0.6 Second-Moment Shift (normalized) 0.0 0.2 0.4 0.6 0.8 F aith f uln e s s G a p ( ID) Pearson r = 0.993 Spearman = 1.000 0.0 0.2 0.4 0.6 0.8 1.0 O O D S e v e rit… view at source ↗
Figure 10
Figure 10. Figure 10: Proposition 1 verification (r=64) at pure OOD. Top row: Transcoder. Bottom row: SAE. Within each model, larger shifts correspond to larger gaps for both explainer types [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Empirical verification of Theorem 1 on the controlled toy setting. Projection-loss improvement I(s) versus the squared faithfulness gap ∆(ΠID) 2 , swept across OOD severity s ∈ [0, 1]. The dashed line is a linear fit (R2 = 0.93, Pearson r = 0.96), supporting the quadratic dependence predicted by Theorem 1 [PITH_FULL_IMAGE:figures/full_fig_p022_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Per-feature DLA on a prompt predicting ‘ Henry’ (GPT-2, Transcoder). The truncated input is “Question: What nationality was James”; the next token is a male first name. Each cell reports the feature’s direct logit attribution to a candidate token, with 20 male first names on the left and 10 unrelated noun controls on the right. Fixed’s total class-specificity is +1.00 and GAE’s is +4.51, a 4.5× amplificat… view at source ↗
Figure 13
Figure 13. Figure 13: Per-feature DLA on a prompt predicting ‘ politician’ (GPT-2, Transcoder). The truncated input is “Question: Which American”; the next token is a profession. The 20 class-member tokens are common professions and the 10 controls are unrelated nouns. Fixed’s total class-specificity is +0.54 and GAE’s is +0.99. The GAE row drives several control cells negative (blue), where Fixed leaves them positive, sharpen… view at source ↗
Figure 14
Figure 14. Figure 14: Hyperparameter sweeps on HaluEval (GPT-2, Transcoder): nComp (orange, left axis) and |∆CE| (cyan, right axis) are stable across rank r, OOD sample size NOOD, and preservation weight λpres. Rank r: nComp stays above 0.95 for every r∈ {1, . . . , 64}; rank-1 already gives 0.951, confirming that the ID-to-OOD drift concentrates in a few directions. OOD sample size NOOD: |∆CE| improves from 0.038 at N = 500 t… view at source ↗
Figure 15
Figure 15. Figure 15: Faithfulness of training-free explainer methods on held-out in-distribution data with the Transcoder explainer. The left axis plots the causal-faithfulness metrics nAOPC and nComp (higher is better) and the right axis plots reconstruction quality |∆CE| (lower is better). Both backbones show the same pattern: GAE lifts nAOPC and nComp above Fixed and TERM, with the largest swing on GPT-2 (nComp +0.10), whi… view at source ↗
Figure 16
Figure 16. Figure 16: Faithfulness of training-free explainer methods on held-out in-distribution data with the Top-K SAE explainer. The left axis plots nAOPC and nComp (higher is better) and the right axis plots |∆CE| (lower is better). The SAE dictionary already sits closer to optimal on this ID slice, so the gap to Fixed is narrower than on the Transcoder cells, but GAE still moves both causal metrics in the right direction… view at source ↗
read the original abstract

Mechanistic interpretability aims to explain a model's behavior by identifying causally responsible internal structures. Dictionary-based explainers such as sparse autoencoders and transcoders are a primary tool, but their faithfulness under out-of-distribution (OOD) shift has received little systematic attention. We show that distribution shift rotates the subspace that the model actively uses, misaligning the explainer's dictionary trained on in-distribution (ID) activations. We formalize this misalignment as the faithfulness gap, a geometric distance between the ID dictionary and the OOD-active subspace, and show that it controls OOD faithfulness degradation. To reduce this gap, we propose the Geometry-Adaptive Explainer (GAE), which realigns the explainer's dictionary with the OOD-active subspace while preserving the original feature structure. This requires only unlabeled OOD activations and no gradient updates. We prove that GAE improves over the unadapted ID explainer, with excess loss bounded quadratically by the second-moment shift. Empirically, GAE even matches or surpasses all training-based baselines in causal faithfulness across multiple models and OOD settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that distribution shift rotates the active subspace used by a model, misaligning ID-trained dictionary explainers (e.g., sparse autoencoders) and degrading OOD faithfulness. It formalizes this as a 'faithfulness gap' equal to the geometric distance between the ID dictionary and OOD-active subspace, proposes the Geometry-Adaptive Explainer (GAE) that realigns the dictionary to the OOD subspace via a structure-preserving map using only unlabeled OOD activations, proves that the excess loss of GAE over the unadapted explainer is bounded quadratically by the second-moment shift, and reports that GAE matches or exceeds training-based baselines in causal faithfulness across models and OOD settings.

Significance. If the central proof and the invariance of causal feature interpretations under realignment hold, the work supplies a lightweight, training-free adaptation method for dictionary-based interpretability that directly ties geometric misalignment to faithfulness degradation. This could strengthen reliability of mechanistic explanations under shift without requiring labeled OOD data or gradient updates, and the quadratic bound offers a concrete, testable link between distribution shift statistics and explanation quality.

major comments (3)
  1. [§3.2] §3.2 (Realignment Operator): The claim that the realignment 'preserves the original feature structure' is stated as a property of the chosen linear map or projection, but the manuscript does not derive that this operator commutes with the sparsity selection or causal intervention used to measure faithfulness. Without this, geometric gap reduction does not necessarily imply improved causal faithfulness, as atom mixing could alter individual feature semantics while reducing the reported distance.
  2. [§4] §4 (Proof of Quadratic Excess-Loss Bound): The bound is expressed in terms of the second-moment shift, which is treated as an external quantity. It is unclear from the derivation whether the bound remains valid when the realignment operator is itself estimated from the same OOD activations that define the shift; a self-referential dependence would require an additional contraction or fixed-point argument that is not supplied.
  3. [Table 2, §5.3] Table 2 and §5.3 (Empirical Faithfulness): The causal faithfulness metric relies on intervention-based evaluation, yet the paper does not report whether the same intervention sets are used for both ID and OOD regimes or whether the realignment affects the support of the selected features. If the support changes, the cross-regime comparison may confound geometric improvement with changes in the underlying causal variables.
minor comments (2)
  1. [§2] Notation for the OOD-active subspace is introduced in §2 but reused without redefinition in the proof; a single forward reference or appendix glossary would improve readability.
  2. [Figure 3] Figure 3 caption does not specify the exact number of OOD samples used to estimate the active subspace; this detail is needed to assess sensitivity of the reported gains.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive comments. We address each major point below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (Realignment Operator): The claim that the realignment 'preserves the original feature structure' is stated as a property of the chosen linear map or projection, but the manuscript does not derive that this operator commutes with the sparsity selection or causal intervention used to measure faithfulness. Without this, geometric gap reduction does not necessarily imply improved causal faithfulness, as atom mixing could alter individual feature semantics while reducing the reported distance.

    Authors: We agree that an explicit derivation of commutation would make the connection between geometric realignment and causal faithfulness more rigorous. The realignment operator is an orthogonal map onto the OOD-active subspace chosen to preserve inner products among dictionary atoms. In the revised manuscript we will add a short lemma in §3.2 establishing that, under the standard incoherence assumption used for dictionary learning, this operator commutes with the sparsity selection step. Consequently, the support and semantics of individual atoms remain unchanged for the purpose of causal interventions, so that reduction of the geometric gap directly improves the measured faithfulness. revision: yes

  2. Referee: [§4] §4 (Proof of Quadratic Excess-Loss Bound): The bound is expressed in terms of the second-moment shift, which is treated as an external quantity. It is unclear from the derivation whether the bound remains valid when the realignment operator is itself estimated from the same OOD activations that define the shift; a self-referential dependence would require an additional contraction or fixed-point argument that is not supplied.

    Authors: The referee correctly notes that the current proof treats the realignment operator as given with respect to population quantities. When the operator is estimated from the same finite OOD sample that defines the second-moment shift, a dependence arises. We will augment the proof in §4 with a contraction-mapping argument: the subspace estimator is Lipschitz continuous in the second-moment matrix, and a standard fixed-point result shows that the quadratic excess-loss bound continues to hold with an additive term that vanishes at rate 1/√n for n OOD samples. This supplies the missing self-referential control without changing the leading-order result. revision: yes

  3. Referee: [Table 2, §5.3] Table 2 and §5.3 (Empirical Faithfulness): The causal faithfulness metric relies on intervention-based evaluation, yet the paper does not report whether the same intervention sets are used for both ID and OOD regimes or whether the realignment affects the support of the selected features. If the support changes, the cross-regime comparison may confound geometric improvement with changes in the underlying causal variables.

    Authors: We confirm that the intervention sets are held fixed across ID and OOD regimes so that the same causal variables are tested. Because the realignment operator is an isometry restricted to the active subspace, it leaves the ordering and support of the top-k activated atoms unchanged; the identical feature indices are therefore selected and intervened upon in both regimes. We will add an explicit statement of this protocol to §5.3 and to the caption of Table 2, together with a brief verification that feature support is invariant under the reported realignment. revision: yes

Circularity Check

0 steps flagged

No circularity: bound derived from independent geometric and distributional quantities

full rationale

The paper defines the faithfulness gap as the geometric distance between the ID dictionary and the OOD-active subspace, then proves an excess-loss bound quadratic in the second-moment shift. The second-moment shift is an external, observable property of the distribution change rather than a fitted parameter or quantity defined in terms of the bound itself. The realignment step is presented as a constructive method using unlabeled OOD activations, and the preservation of feature structure is an explicit modeling assumption rather than a derived identity. No equation reduces the claimed improvement to a tautology or to a self-citation chain; the derivation therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the geometric view of subspace rotation under shift and the claim that dictionary realignment preserves feature semantics. No explicit free parameters are named in the abstract. The method introduces GAE as a new procedure rather than a new physical entity.

axioms (2)
  • domain assumption Distribution shift rotates the subspace that the model actively uses.
    Stated directly in the abstract as the source of misalignment.
  • domain assumption Realignment can be performed while preserving original feature structure.
    Required for the claim that GAE improves faithfulness without retraining.

pith-pipeline@v0.9.0 · 5728 in / 1418 out tokens · 25930 ms · 2026-05-22T08:11:29.851045+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

49 extracted references · 49 canonical work pages · 8 internal anchors

  1. [1]

    Causal abstraction: A theoretical foundation for mechanistic interpretability

    Atticus Geiger, Duligur Ibeling, Amir Zur, Maheep Chaudhary, Sonakshi Chauhan, Jing Huang, Aryaman Arora, Zhengxuan Wu, Noah Goodman, Christopher Potts, et al. Causal abstraction: A theoretical foundation for mechanistic interpretability. Journal of Machine Learning Research, 26(83):1–64, 2025

  2. [2]

    In-context Learning and Induction Heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. arXiv preprint arXiv:2209.11895, 2022

  3. [3]

    Open Problems in Mechanistic Interpretability

    Lee Sharkey, Bilal Chughtai, Joshua Batson, Jack Lindsey, Jeff Wu, Lucius Bushnaq, Nicholas Goldowsky-Dill, Stefan Heimersheim, Alejandro Ortega, Joseph Bloom, et al. Open problems in mechanistic interpretability. arXiv preprint arXiv:2501.16496, 2025

  4. [4]

    Towards monosemanticity: Decomposing language models with dictionary learning

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Con- erly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and ...

  5. [5]

    Transcoders find interpretable llm feature circuits

    Jacob Dunefsky, Philippe Chlenski, and Neel Nanda. Transcoders find interpretable llm feature circuits. Advances in Neural Information Processing Systems, 37:24375–24410, 2024

  6. [6]

    Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet

    Adly Templeton. Scaling monosemanticity: Extracting interpretable features from claude 3 sonnet. Anthropic, 2024

  7. [7]

    Scaling and evaluating sparse autoencoders

    Leo Gao, Tom Dupré la Tour, Henk Tillman, Gabriel Goh, Rajan Troll, Alec Radford, Ilya Sutskever, Jan Leike, and Jeffrey Wu. Scaling and evaluating sparse autoencoders. arXiv preprint arXiv:2406.04093, 2024

  8. [8]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    Samuel Marks, Can Rager, Eric J Michaud, Yonatan Belinkov, David Bau, and Aaron Mueller. Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647, 2024

  9. [9]

    Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685, 2020

    Alon Jacovi and Yoav Goldberg. Towards faithfully interpretable nlp systems: How should we define and evaluate faithfulness? arXiv preprint arXiv:2004.03685, 2020

  10. [10]

    Negative results for sparse autoencoders on downstream tasks and deprioritising sae research

    Google DeepMind Safety Research. Negative results for sparse autoencoders on downstream tasks and deprioritising sae research. DeepMind Safety Research Blog, 2025. Blog post

  11. [11]

    Sanity checks for saliency maps

    Julius Adebayo, Justin Gilmer, Michael Muelly, Ian Goodfellow, Moritz Hardt, and Been Kim. Sanity checks for saliency maps. Advances in neural information processing systems, 31, 2018

  12. [12]

    Interpretation of neural networks is fragile

    Amirata Ghorbani, Abubakar Abid, and James Zou. Interpretation of neural networks is fragile. In Proceedings of the AAAI conference on artificial intelligence, volume 33, pages 3681–3688, 2019

  13. [13]

    On the robustness of removal-based feature attributions

    Chris Lin, Ian Covert, and Su-In Lee. On the robustness of removal-based feature attributions. Advances in Neural Information Processing Systems, 36:79613–79666, 2023

  14. [14]

    On the consistency and robustness of saliency explanations for time series classification

    Chiara Balestra, Bin Li, and Emmanuel Müller. On the consistency and robustness of saliency explanations for time series classification. arXiv preprint arXiv:2309.01457, 2023

  15. [15]

    Faithfulsae: Towards capturing faithful features with sparse autoencoders without external datasets dependency

    Seonglae Cho, Harryn Oh, Donghyun Lee, Luis Rodrigues Vieira, Andrew Bermingham, and Ziad El Sayed. Faithfulsae: Towards capturing faithful features with sparse autoencoders without external datasets dependency. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume4: Student Research Workshop), pages 297–314, 2025

  16. [16]

    Tilted empirical risk minimization

    Tian Li, Ahmad Beirami, Maziar Sanjabi, and Virginia Smith. Tilted empirical risk minimization. arXiv preprint arXiv:2007.01162, 2020. 10

  17. [17]

    Decoding dark matter: Specialized sparse autoencoders for interpreting rare concepts in foundation models

    Aashiq Muhamed, Mona Diab, and Virginia Smith. Decoding dark matter: Specialized sparse autoencoders for interpreting rare concepts in foundation models. InFindings of the Association for Computational Linguistics: NAACL 2025, pages 1604–1635, 2025

  18. [18]

    Teach old saes new domain tricks with boosting

    Nikita Koriagin, Yaroslav Aksenov, Daniil Laptev, Gleb Gerasimov, Nikita Balagansky, and Daniil Gavrilov. Teach old saes new domain tricks with boosting. arXiv preprint arXiv:2507.12990, 2025

  19. [19]

    Mechanistic Interpretability for AI Safety -- A Review

    Leonard Bereska and Efstratios Gavves. Mechanistic interpretability for ai safety–a review. arXiv preprint arXiv:2404.14082, 2024

  20. [20]

    A simple unified framework for detecting out-of-distribution samples and adversarial attacks

    Kimin Lee, Kibok Lee, Honglak Lee, and Jinwoo Shin. A simple unified framework for detecting out-of-distribution samples and adversarial attacks. Advances in neural information processing systems, 31, 2018

  21. [21]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoen- coders find highly interpretable features in language models. arXiv preprint arXiv:2309.08600, 2023

  22. [22]

    Transcoders beat sparse autoencoders for interpretability

    Gonçalo Paulo, Stepan Shabalin, and Nora Belrose. Transcoders beat sparse autoencoders for interpretability. arXiv preprint arXiv:2501.18823, 2025

  23. [23]

    Normalized aopc: Fixing misleading faithfulness metrics for feature attributions explainability

    Joakim Edin, Andreas Geert Motzfeldt, Casper L Christensen, Tuukka Ruotsalo, Lars Maaløe, and Maria Maistro. Normalized aopc: Fixing misleading faithfulness metrics for feature attributions explainability. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (V olume1: Long Papers), pages 1715–1730, 2025

  24. [24]

    Eraser: A benchmark to evaluate rationalized nlp models

    Jay DeYoung, Sarthak Jain, Nazneen Fatema Rajani, Eric Lehman, Caiming Xiong, Richard Socher, and Byron C Wallace. Eraser: A benchmark to evaluate rationalized nlp models. In Proceedings of the 58th annual meeting of the association for computational linguistics, pages 4443–4458, 2020

  25. [25]

    Causal scrubbing: A method for rigorously testing interpretability hypotheses

    Lawrence Chan, Adria Garriga-Alonso, Nicholas Goldowsky-Dill, Ryan Greenblatt, Jenny Nitishinskaya, Ansh Radhakrishnan, Buck Shlegeris, and Nate Thomas. Causal scrubbing: A method for rigorously testing interpretability hypotheses. In AI Alignment Forum, volume 2, 2022

  26. [26]

    Locating and editing factual associations in gpt

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022

  27. [27]

    Similarity of neural network representations revisited

    Simon Kornblith, Mohammad Norouzi, Honglak Lee, and Geoffrey Hinton. Similarity of neural network representations revisited. In International conference on machine learning, pages 3519–3529. PMLR, 2019

  28. [28]

    Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability

    Maithra Raghu, Justin Gilmer, Jason Yosinski, and Jascha Sohl-Dickstein. Svcca: Singular vector canonical correlation analysis for deep learning dynamics and interpretability. Advances in neural information processing systems, 30, 2017

  29. [29]

    Generalized shape metrics on neural representations

    Alex H Williams, Erin Kunz, Simon Kornblith, and Scott Linderman. Generalized shape metrics on neural representations. Advances in neural information processing systems, 34:4738–4750, 2021

  30. [30]

    Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2

    Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma scope: Open sparse autoencoders everywhere all at once on gemma 2. In Proceedings of the 7th BlackboxNLP Workshop: Analyzing and Interpreting Neural Networks for NLP, pages 278– 300, 2024

  31. [31]

    Predicting trends in the qual- ity of state-of-the-art neural networks without access to training or testing data

    Charles H Martin, Tongsu Peng, and Michael W Mahoney. Predicting trends in the qual- ity of state-of-the-art neural networks without access to training or testing data. Nature Communications, 12(1):4122, 2021. 11

  32. [32]

    Intrinsic dimension of data representations in deep neural networks

    Alessio Ansuini, Alessandro Laio, Jakob H Macke, and Davide Zoccolan. Intrinsic dimension of data representations in deep neural networks. Advances in Neural Information Processing Systems, 32, 2019

  33. [33]

    Intrinsic dimensionality explains the effectiveness of language model fine-tuning

    Armen Aghajanyan, Sonal Gupta, and Luke Zettlemoyer. Intrinsic dimensionality explains the effectiveness of language model fine-tuning. In Proceedings of the 59th annual meeting of the association for computational linguistics and the 11th international joint conference on natural language processing (volume 1: long papers), pages 7319–7328, 2021

  34. [34]

    Gemma scope 2: Technical paper

    Callum McDougall, Arthur Conmy, János Kramár, Tom Lieberum, Senthooran Rajamanoharan, and Neel Nanda. Gemma scope 2: Technical paper. Technical report, Google DeepMind, 2025

  35. [35]

    The rotation of eigenvectors by a perturbation

    Chandler Davis and William Morton Kahan. The rotation of eigenvectors by a perturbation. iii. SIAM Journal on Numerical Analysis, 7(1):1–46, 1970

  36. [36]

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

    Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with jumprelu sparse autoencoders. arXiv preprint arXiv:2407.14435, 2024

  37. [37]

    A generalized solution of the orthogonal procrustes problem

    Peter H Schönemann. A generalized solution of the orthogonal procrustes problem. Psychometrika, 31(1):1–10, 1966

  38. [38]

    Language models are unsupervised multitask learners

    Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, Ilya Sutskever, et al. Language models are unsupervised multitask learners. OpenAI blog, 1(8):9, 2019

  39. [39]

    Pythia: A suite for analyzing large language models across training and scaling

    Stella Biderman, Hailey Schoelkopf, Quentin Gregory Anthony, Herbie Bradley, Kyle O’Brien, Eric Hallahan, Mohammad Aflah Khan, Shivanshu Purohit, USVSN Sai Prashanth, Edward Raff, et al. Pythia: A suite for analyzing large language models across training and scaling. In International Conference on Machine Learning, pages 2397–2430. PMLR, 2023

  40. [40]

    Fineweb: decanting the web for the finest text data at scale

    Guilherme Penedo, Hynek Kydlícek, Loubna Ben Allal, and Thomas Wolf. Fineweb: decanting the web for the finest text data at scale. HuggingFace. Accessed: Jul, 12, 2024

  41. [41]

    Edgar- corpus: Billions of tokens make the world go round

    Lefteris Loukas, Manos Fergadiotis, Ion Androutsopoulos, and Prodromos Malakasiotis. Edgar- corpus: Billions of tokens make the world go round. In Proceedings of the Third Workshop on Economics and Natural Language Processing, pages 13–18, 2021

  42. [42]

    Halueval: A large-scale hallucination evaluation benchmark for large language models

    Junyi Li, Xiaoxue Cheng, Wayne Xin Zhao, Jian-Yun Nie, and Ji-Rong Wen. Halueval: A large-scale hallucination evaluation benchmark for large language models. In Proceedings of the 2023 conference on empirical methods in natural language processing, pages 6449–6464, 2023

  43. [43]

    Saes (usually) transfer between base and chat models

    Connor Kissane, Robert Krzyzanowski, Arthur Conmy, and Neel Nanda. Saes (usually) transfer between base and chat models. Alignment Forum,

  44. [44]

    URL https://www.alignmentforum.org/posts/fmwk6qxrpW8d4jvbd/ saes-usually-transfer-between-base-and-chat-models

  45. [45]

    The geometry of algorithms with or- thogonality constraints

    Alan Edelman, Tomás A Arias, and Steven T Smith. The geometry of algorithms with or- thogonality constraints. SIAM journal on Matrix Analysis and Applications, 20(2):303–353, 1998

  46. [46]

    A useful variant of the davis–kahan theorem for statisticians

    Yi Yu, Tengyao Wang, and Richard J Samworth. A useful variant of the davis–kahan theorem for statisticians. Biometrika, 102(2):315–323, 2015

  47. [47]

    Openwebtext corpus

    Aaron Gokaslan and Vanya Cohen. Openwebtext corpus. http://Skylion007.github.io/ OpenWebTextCorpus, 2019

  48. [48]

    The Pile: An 800GB Dataset of Diverse Text for Language Modeling

    Leo Gao, Stella Biderman, Sid Black, Laurence Golding, Travis Hoppe, Charles Foster, Jason Phang, Horace He, Anish Thite, Noa Nabeshima, et al. The pile: An 800gb dataset of diverse text for language modeling. arXiv preprint arXiv:2101.00027, 2020

  49. [49]

    Fine-tuning can distort pretrained features and underperform out-of-distribution, 2022

    Ananya Kumar, Aditi Raghunathan, Robbie Jones, Tengyu Ma, and Percy Liang. Fine- tuning can distort pretrained features and underperform out-of-distribution. arXiv preprint arXiv:2202.10054, 2022. 12 A Proofs and Derivations A.1 Proof of Proposition 1 Setup.Write the second-moment shift as E=M OOD −M ID, so that MOOD =M ID +E . The projectors ΠID and ΠOOD...