arxiv: 2605.08740 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Causal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure

Nilesh Sarkar , Dawar Jyoti Deka

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:06 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords causal dimensionalitysparse autoencoderstransformer representationsattribution patchingeffective rankmodel scalinglayer structureJacobian outer product

0 comments

The pith

Transformer layers have an intrinsic causal dimensionality that saturates well below their full feature capacity and stays fixed under scaling and depth changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper defines causal dimensionality kappa as the effective rank of the expected Jacobian outer product at each layer and estimates it by sweeping sparse autoencoder widths while measuring causal effects via attribution patching. It finds that representational features expand 15.6 times as SAE width grows but causal capacity expands only 4.35 times, creating a consistent gap called the representational-causal wedge. The same kappa value appears in models that differ by a factor of 3.46 in parameters, and the value remains steady from early to late layers even as the raw attribution threshold falls sharply. These patterns indicate that kappa is a stable, model-intrinsic quantity rather than an artifact of size or position.

Core claim

Causal dimensionality kappa is recovered at approximately 1,990 for Gemma-2-2B layer 12, with a participation-ratio lower bound of 280 and kappa divided by model dimension equal to 0.86. The quantity is invariant to model scaling, returning identical causal neuron counts at matched SAE widths for Gemma-2-9B and Gemma-2-2B, and it is constant across eight depths while absolute attribution thresholds drop by a factor of twenty.

What carries the argument

Causal dimensionality kappa, defined as the effective rank of the expected Jacobian outer product at layer L, recovered by sweeping SAE widths and applying attribution patching to isolate causal influence on model outputs.

If this is right

Full causal recovery requires SAE widths substantially larger than the recovered kappa value.
Increasing total model parameters does not raise the causal dimensionality of any layer.
Deeper layers maintain the same causal rank even though their individual attribution scores become smaller.
The gap between representational and causal capacity persists across architecture controls and threshold choices.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Methods that aim to edit or interpret model behavior may only need to track a few hundred causal directions per layer rather than the full residual stream.
The constancy of kappa across depth suggests early layers already compress causal dependencies to a stable low-dimensional form.
Testing whether kappa changes with task complexity or dataset statistics would show whether it is purely architectural or partly data-dependent.

Load-bearing premise

The combination of SAE width sweeps and attribution patching recovers the true effective rank of the expected Jacobian outer product without large bias from training dynamics, patching interference, or the fixed 2 percent calibration cutoff.

What would settle it

A change in the shape of the attribution patching score distribution or a different recovered N_causal when the same SAE widths are applied to models larger than those tested would falsify the scaling-invariance claim.

Figures

Figures reproduced from arXiv: 2605.08740 by Dawar Jyoti Deka, Nilesh Sarkar.

**Figure 1.** Figure 1: Width scaling on Gemma-2-2B layer 12. The non-monotonicity at m=65,536 is a feature-splitting artefact (§3.5). The wedge ratio Ncausal(1M)/Ncausal(16k) has bootstrap 95% CI [3.86, 4.91], excluding the Nrepr ratio of 15.6 by an order of magnitude [PITH_FULL_IMAGE:figures/full_fig_p006_1.png] view at source ↗

**Figure 2.** Figure 2: Confidence intervals on the headline numbers. (a) The wedge-ratio bootstrap CI [3.86, 4.91] (B = 10,000, resampling features with replacement) excludes the Nrepr ratio of 15.6, ruling out the wedge as a sampling artefact. (b) The Wald CI on κˆ is wide because the saturating fit has only 4 post-minimum points (a hardware ceiling at GemmaScope’s m = 106 ), but the non-parametric bootstrap CI [545, 5130] is … view at source ↗

**Figure 3.** Figure 3: Scale invariance and layer profile on Gemma-2-2B. Any cross-layer study using a fixed absolute threshold will over-count causally active features in late layers. (a) Architecture invariance. Sub-linear Ncausal scaling holds across TopK, JumpReLU, and standard ReLU SAEs. (b) Four-cell ablation. Random encoder inflates Ncausal to 9.27×: the trained encoder acts as a sparsity filter, not a feature selector … view at source ↗

**Figure 4.** Figure 4: Architecture invariance and encoder/decoder ablation on Gemma-2-2B layer 12. The wedge is robust across SAE families; the four-cell ablation isolates the encoder’s suppression role. 6 Controls and Negative Results 6.1 Threshold Robustness Sweeping ε from 0.1× to 10× the calibrated value yields width-to-Ncausal ratios of {21.42, 12.26, 4.35, 3.10, 4.43}× at {0.1, 0.3, 1.0, 3.0, 10.0}× respectively (exp4_ep… view at source ↗

**Figure 5.** Figure 5: Synthetic ground-truth recovery and geometric privilege test. Both results pin down what κ measures: not cosine alignment to causal directions, but encoder activation patterns routed through the decoder. 6.3 Encoder vs Decoder: A Four-Cell Decomposition Figure 4b constructs three controls: RANDOM_DEC replaces the trained decoder with a random orthonormal matrix; SHUFFLED_DEC permutes decoder rows (preservi… view at source ↗

**Figure 6.** Figure 6: AtP validation on Gemma-2-2B layer 12. Attribution patching provides a reliable ranking in the high-score regime that governs Ncausal measurement [PITH_FULL_IMAGE:figures/full_fig_p011_6.png] view at source ↗

**Figure 7.** Figure 7: κ/d ˆ model is robust across fit families and bootstrap. Saturating-exponential postminimum fit, saturating-exponential with explicit dip term (all 7 widths), and non-parametric bootstrap median all yield κ/d ˆ model ∈ [0.74, 0.86]. None reach the ceiling κ/d ˆ = 1, consistent with κ being below but close to the ambient dimension on generic Pile-10k [PITH_FULL_IMAGE:figures/full_fig_p011_7.png] view at source ↗

**Figure 8.** Figure 8: Per-width Ncausal with bootstrap 95% CI (B = 10,000, resampling features with replacement). The post-minimum saturating-exponential fit (purple dashed) recovers the data within CIs at all widths past m= 65,536. 11 [PITH_FULL_IMAGE:figures/full_fig_p011_8.png] view at source ↗

**Figure 9.** Figure 9: Threshold robustness sweep on Gemma-2-2B layer 12. The wedge is robust to threshold choice. D The L0 Confound in the Canonical Sweep The canonical GemmaScope sweep yields a non-monotone Ncausal with a dip at m = 65,536 and apparent super-linear growth thereafter ([328, 407, 731, 4516, 13960, 28801]). The largest value, 28,801, exceeds d = 2304 and would, taken at face value, contradict the bound of Proposi… view at source ↗

**Figure 10.** Figure 10: Canonical-L0 sweep showing the L0 confound. The dip at m = 65,536 is inflated relative to the matched-L0 result ( [PITH_FULL_IMAGE:figures/full_fig_p013_10.png] view at source ↗

**Figure 11.** Figure 11: Full architecture invariance results. Sub-linear Ncausal scaling holds across model families (Pythia-70m, Gemma-2-2B) and SAE types. circuit on the correct distribution and only 0.04 on the wrong distribution. Cuts at p92 and below have ε=0 and trivially select all 32,768 features; we do not interpret these as meaningful thresholds. G Sparsity Dependence: Full Results We probe how per-feature firing proba… view at source ↗

**Figure 12.** Figure 12: Task-specific validation vs the Marks et al. (2025) SVA circuit on Pythia-70m layer 4. The contrast directly confirms that generic-text Ncausal is the union of task-specific circuits. (a) Spearman ρ between pact and causal membership by architecture. Gated, p_anneal, and standard SAEs anti-correlate; TopK is the lone positive exception. (b) Global scatter of pact vs causal membership across 20 SAEs. Globa… view at source ↗

**Figure 13.** Figure 13: Sparsity dependence. Causal features anti-correlate with pact in 3/4 architectures. The TopK exception is structural [PITH_FULL_IMAGE:figures/full_fig_p014_13.png] view at source ↗

**Figure 14.** Figure 14: Full geometric privilege analysis on Gemma-2-2B layer 12. Inclusion gap between causal and control decoder directions is ≈ 0 at all SAE widths. 14 [PITH_FULL_IMAGE:figures/full_fig_p014_14.png] view at source ↗

**Figure 15.** Figure 15: Auto-interpretability coherence, Gemma-2-2B layer 12, width 16,384. Left: coherence distributions (1–5 scale) for causal (red) and control (green) features; gap = −0.15. Right: coherence vs AtP score; no strong positive relationship, consistent with the union-of-circuits interpretation. J Cross-Family Wedge Replication To rule out that the representational–causal wedge is an artefact of the Gemma-2 fami… view at source ↗

**Figure 16.** Figure 16: Cross-family wedge: four-panel summary. (a) The wedge holds across three independently trained model families with inert fraction > 95% in all cases. (b) The Pythia SAE has only 42% utilisation (Nrepr/m), against ∼94% for LLaMA and ∼97% for Gemma: the EleutherAI sae-pythia-410m-65k release has 58% dead features, which is the cause of Pythia’s lower observed inert fraction. (c) Ncausal scales with SAE wi… view at source ↗

**Figure 17.** Figure 17: Cross-family AtP distribution shape and threshold sensitivity. The structural form of the wedge (heavy-tailed AtP, sub-linear Ncausal growth) is preserved across model families and hook positions; absolute magnitudes vary as expected with dmodel, SAE width, and hook position. 16 [PITH_FULL_IMAGE:figures/full_fig_p016_17.png] view at source ↗

read the original abstract

Sparse autoencoders (SAEs) decompose transformer residual streams into interpretable feature dictionaries, yet the relationship between SAE width and causal influence on model output has not been systematically characterised. We introduce causal dimensionality kappa(L, M, T), defined as the effective rank of the expected Jacobian outer product at layer L, and show it can be estimated via the SAE width sweep paired with attribution patching. Across seven SAE widths from 16,384 to 1,048,576 features on Gemma-2-2B layer 12, representational capacity grows 15.6x while causal capacity grows only 4.35x: a robust separation we term the representational-causal wedge. A saturating fit yields kappa-hat approximately 1,990 with kappa-hat / d_model = 0.86 and participation-ratio lower bound kappa_PR approximately 280. Crucially, kappa is invariant to model scaling: Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 at the same SAE width despite a 3.46x parameter increase (the count is forced to 2% of SAE width by calibration; the substantive empirical claim is shape invariance of the AtP score distribution under matched seq=512 conditions). Across eight network depths kappa is constant while the absolute attribution threshold drops 20x from layer 1 to layer 23. Five controls (architecture invariance, threshold robustness, geometric privilege, synthetic ground-truth recovery, and a four-cell encoder/decoder ablation) pin down what kappa measures and what it does not. Our findings establish kappa as a measurable, model-intrinsic property of transformer layers: sub-linearly recoverable by SAE width, invariant to model scaling, and structured across network depth.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper's core claim is that causal dimensionality kappa stays constant across model scales and depths while representational capacity grows faster, measured via SAE width sweeps plus attribution patching.

read the letter

The main thing here is that kappa, defined as the effective rank of the expected Jacobian outer product, comes out invariant to scaling between Gemma-2-2B and 9B and constant across eight layers, even as absolute attribution thresholds drop sharply. Representational features grow 15.6x with wider SAEs but causal ones only 4.35x, which they call the representational-causal wedge. A saturating fit gives kappa-hat around 1990 for the 2B model at layer 12, with a participation-ratio lower bound near 280. N_causal lands at 328 for both model sizes at matched SAE width, though that count is set by the 2% calibration rule. The substantive part is the shape of the attribution score distribution holding steady under fixed sequence length 512. Five controls cover architecture invariance, threshold robustness, geometric checks, synthetic ground-truth recovery, and encoder/decoder ablations. Those steps make the measurement more concrete than a raw SAE run. The soft spot is exactly the one the stress-test flags: whether the width sweep and attribution patching recover an unbiased estimate of effective Jacobian rank. SAE training dynamics and patching interference could shift differently at larger scales or deeper layers, and the calibration plus saturating fit tie the numbers to the procedure itself. The controls help but do not fully isolate those effects under the actual experimental conditions. This is for mechanistic interpretability people who need a quantitative way to separate how much of a layer's capacity is causally usable versus just present in the residual stream. Anyone already running SAEs or doing attribution work would get usable numbers and a clear hypothesis to test. It deserves a serious referee because the empirical separation and invariance results are specific enough to check against the methods, even if the estimator needs more validation. Send it to peer review.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces causal dimensionality kappa(L, M, T) as the effective rank of the expected Jacobian outer product at transformer layer L, estimated via SAE width sweeps paired with attribution patching. It reports a representational-causal wedge in which SAE width increases representational capacity 15.6x but causal capacity only 4.35x on Gemma-2-2B layer 12; kappa is invariant to model scaling (identical N_causal=328 for Gemma-2-9B and 2B at matched widths despite 3.46x parameter difference); and kappa remains constant across eight depths while absolute attribution thresholds drop 20x. Five controls (architecture invariance, threshold robustness, geometric privilege, synthetic ground-truth recovery, encoder/decoder ablation) are used to characterize what kappa measures.

Significance. If the width-sweep plus attribution-patching procedure recovers an unbiased estimate of effective rank, the work would establish a new intrinsic, model-scale-invariant property of transformer layers that separates representational from causal capacity and is structured by depth. The explicit listing of five controls, including synthetic ground-truth recovery, and the direct reporting of the wedge constitute strengths that would make the result useful for interpretability and scaling research.

major comments (3)

[Abstract] Abstract: the reported invariance to model scaling rests on identical N_causal=328 at the same SAE width for Gemma-2-9B and Gemma-2-2B, yet N_causal is explicitly forced to 2% of SAE width by calibration; while the text correctly identifies shape invariance of the AtP score distribution as the substantive claim, no quantitative comparison (e.g., distribution overlap or statistical test) of the AtP scores across the two models is provided to demonstrate that the shapes are in fact invariant rather than an artifact of the shared calibration rule.
[Abstract] Abstract: the saturating fit that yields kappa-hat approximately 1,990 (with kappa-hat / d_model = 0.86) is applied post-hoc to the seven-width sweep data; without the explicit functional form, fitted parameters, or goodness-of-fit diagnostics for this saturating model, it is impossible to assess whether the reported value reflects a genuine property of the Jacobian or is shaped by the choice of saturating function and width range (16,384 to 1,048,576).
[Abstract] Abstract: the central methodological assumption—that the SAE width sweep combined with attribution patching recovers the effective rank of the expected Jacobian outer product without substantial bias from SAE training dynamics, reconstruction error, or patching interference—is load-bearing for all scaling and depth claims. Although five controls are listed, including synthetic ground-truth recovery, it is not shown that these controls were executed under the precise experimental conditions (seq=512, multi-layer residual streams) used for the main Gemma experiments, leaving open the possibility of differential bias across model scales and depths.

minor comments (2)

The notation kappa(L, M, T) is introduced without an explicit equation or definition of its dependence on sequence length T; a short formal definition would improve clarity.
The participation-ratio lower bound kappa_PR approximately 280 is stated but its exact computation and relation to the main kappa-hat estimate are not expanded; a brief derivation or reference would help readers connect the two quantities.

Simulated Author's Rebuttal

3 responses · 0 unresolved

Thank you for the thoughtful review. We have carefully considered each major comment and provide point-by-point responses below. Where the comments highlight areas for improvement in presentation or additional evidence, we commit to revisions in the next version of the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the reported invariance to model scaling rests on identical N_causal=328 at the same SAE width for Gemma-2-9B and Gemma-2-2B, yet N_causal is explicitly forced to 2% of SAE width by calibration; while the text correctly identifies shape invariance of the AtP score distribution as the substantive claim, no quantitative comparison (e.g., distribution overlap or statistical test) of the AtP scores across the two models is provided to demonstrate that the shapes are in fact invariant rather than an artifact of the shared calibration rule.

Authors: We agree that a quantitative comparison would strengthen the claim of shape invariance. In the revised manuscript, we will add a supplementary figure comparing the AtP score distributions for Gemma-2-9B and Gemma-2-2B at matched SAE widths under seq=512 conditions, including metrics such as distribution overlap (e.g., Jensen-Shannon divergence) or a statistical test to confirm similarity independent of the calibration rule. revision: yes
Referee: [Abstract] Abstract: the saturating fit that yields kappa-hat approximately 1,990 (with kappa-hat / d_model = 0.86) is applied post-hoc to the seven-width sweep data; without the explicit functional form, fitted parameters, or goodness-of-fit diagnostics for this saturating model, it is impossible to assess whether the reported value reflects a genuine property of the Jacobian or is shaped by the choice of saturating function and width range (16,384 to 1,048,576).

Authors: We acknowledge the need for greater transparency in the fitting procedure. The revised manuscript will include the explicit functional form of the saturating model, the fitted parameters, and goodness-of-fit diagnostics such as R² and residual analysis to allow readers to assess whether the kappa estimate reflects properties of the Jacobian rather than the choice of function or width range. revision: yes
Referee: [Abstract] Abstract: the central methodological assumption—that the SAE width sweep combined with attribution patching recovers the effective rank of the expected Jacobian outer product without substantial bias from SAE training dynamics, reconstruction error, or patching interference—is load-bearing for all scaling and depth claims. Although five controls are listed, including synthetic ground-truth recovery, it is not shown that these controls were executed under the precise experimental conditions (seq=512, multi-layer residual streams) used for the main Gemma experiments, leaving open the possibility of differential bias across model scales and depths.

Authors: This is a valid point regarding the scope of the controls. We will revise the manuscript to explicitly state the experimental conditions for each of the five controls. The synthetic ground-truth recovery was performed in a controlled synthetic setting to validate rank recovery, while the remaining controls (architecture invariance, threshold robustness, geometric privilege, and encoder/decoder ablation) were applied to the Gemma residual streams. We will add a clarifying table or section and discuss any limitations in generalizing across scales and depths. revision: partial

Circularity Check

1 steps flagged

N_causal invariance to scaling reduces to 2% calibration rule at matched SAE widths

specific steps

self definitional [Abstract]
"Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 at the same SAE width despite a 3.46x parameter increase (the count is forced to 2% of SAE width by calibration; the substantive empirical claim is shape invariance of the AtP score distribution under matched seq=512 conditions)"

N_causal is defined as the SAE width point where attribution scores hit the 2% calibration threshold. Matching SAE widths across models therefore produces identical N_causal by construction, independent of any actual invariance in the underlying expected Jacobian outer product rank. The invariance claim for kappa thus reduces to the shared calibration rule rather than emerging from the data.

full rationale

The paper's central claims rest on estimating kappa (effective rank of expected Jacobian outer product) via SAE width sweeps + attribution patching, with N_causal explicitly calibrated to 2% of SAE width. This calibration directly forces identical N_causal values when the same widths are used across models, rendering the reported invariance to 3.46x scaling a consequence of the procedure rather than an independent derivation from Jacobian properties. The paper acknowledges the forcing but pivots to AtP score shape invariance; however, the headline results (kappa invariance, representational-causal wedge, depth constancy) inherit this construction without external benchmarks confirming unbiased rank recovery. No load-bearing self-citations or ansatz smuggling detected, but the measurement chain exhibits partial circularity per the fitted/calibrated input pattern.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 1 invented entities

The central claim rests on the new definition of kappa, standard mechanistic-interpretability assumptions about SAEs and attribution patching, and two fitted quantities used to report the asymptotic value and the causal count.

free parameters (2)

saturating-fit parameters = approximately 1990
Used to obtain the reported kappa-hat approximately 1990 from the width-sweep data.
calibration factor for N_causal = 2% of SAE width
Explicitly forces the causal count to 2% of SAE width.

axioms (2)

domain assumption Sparse autoencoders decompose transformer residual streams into interpretable feature dictionaries whose width controls representational capacity.
Invoked to interpret the 15.6x growth in representational capacity.
domain assumption Attribution patching measures causal influence on model output.
Used to estimate causal capacity and the Jacobian rank.

invented entities (1)

causal dimensionality kappa no independent evidence
purpose: Quantifies the effective number of causally relevant features via Jacobian rank.
Newly defined quantity whose value is estimated from the SAE sweep.

pith-pipeline@v0.9.0 · 5622 in / 1755 out tokens · 98716 ms · 2026-05-12T03:06:54.594982+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

κ is invariant to model scaling: Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 ... Across eight network depths κ is constant

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

[1]

H., and Zoccolan, D

Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. (2019). Intrinsic dimension of data representations in deep neural networks. In NeurIPS

work page 2019
[2]

Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling

Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O'Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. (2023). Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. arXiv:2304.01373

work page internal anchor Pith review arXiv 2023
[3]

Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread

work page 2023
[4]

Chanin, D., Lloyd, C., Heimersheim, S., and Hooker, S. (2025). A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders. In ICLR

work page 2025
[5]

Cheng, M., Diaz, M., et al. (2025). Emergence of a High-Dimensional Abstraction Phase in Language Transformers. In ICLR

work page 2025
[6]

Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. (2023). Towards Automated Circuit Discovery for Mechanistic Interpretability. In NeurIPS

work page 2023
[7]

Dubey, A., Jauhri, A., et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Elhage, N., Hume, T., Olsson, C., et al. (2022). Toy Models of Superposition. Transformer Circuits Thread

work page 2022
[9]

Feng, R., Zheng, K., Huang, Y., Zhao, D., Jordan, M., and Zha, Z.-J. (2022). Rank Diminishing in Deep Neural Networks. In NeurIPS

work page 2022
[10]

Scaling and evaluating sparse autoencoders

Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. (2024). Scaling and evaluating sparse autoencoders. arXiv:2406.04093

work page internal anchor Pith review arXiv 2024
[11]

Gemma Team. (2024). Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

Karvonen, A., et al. (2025). SAEBench: A Comprehensive Benchmark for Sparse Autoencoders. In ICML

work page 2025
[13]

Kram\' a r, J., Lieberum, T., Shah, R., and Nanda, N. (2024). AtP*: An efficient and scalable method for localizing LLM behaviour. arXiv:2403.00745

work page arXiv 2024
[14]

k-Sparse Autoencoders

Makhzani, A. and Frey, B. (2013). k-Sparse Autoencoders. arXiv:1312.5663

work page Pith review arXiv 2013
[15]

J., Belinkov, Y., Bau, D., and Mueller, A

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., and Mueller, A. (2025). Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs. In ICLR

work page 2025
[16]

J., Liu, Z., Girit, U., and Tegmark, M

Michaud, E. J., Liu, Z., Girit, U., and Tegmark, M. (2023). The Quantization Model of Neural Scaling. In NeurIPS

work page 2023
[17]

Nanda, N. (2022a). Attribution Patching: Activation Patching at Industrial Scale. https://www.neelnanda.io/mechanistic-interpretability/attribution-patching

work page
[18]

Nanda, N. (2022b). The Pile-10k Subset. https://huggingface.co/datasets/NeelNanda/pile-10k

work page
[19]

J., and Veitch, V

Park, K., Choe, Y. J., and Veitch, V. (2024). The Linear Representation Hypothesis and the Geometry of Large Language Models. In ICML

work page 2024
[20]

Rajamanoharan, S., Lieberum, T., Sonnerat, N., Conmy, A., Varma, V., Kram\' a r, J., and Nanda, N. (2024). JumpReLU Sparse Autoencoders. arXiv:2407.14435

work page arXiv 2024
[21]

and Vetterli, M

Roy, O. and Vetterli, M. (2007). The effective rank: A measure of effective dimensionality. In EUSIPCO

work page 2007
[22]

Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread

work page 2024
[23]

Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2023). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small. In ICLR

work page 2023