Recognition: 2 theorem links
· Lean TheoremCausal Dimensionality of Transformer Representations: Measurement, Scaling, and Layer Structure
Pith reviewed 2026-05-12 03:06 UTC · model grok-4.3
The pith
Transformer layers have an intrinsic causal dimensionality that saturates well below their full feature capacity and stays fixed under scaling and depth changes.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Causal dimensionality kappa is recovered at approximately 1,990 for Gemma-2-2B layer 12, with a participation-ratio lower bound of 280 and kappa divided by model dimension equal to 0.86. The quantity is invariant to model scaling, returning identical causal neuron counts at matched SAE widths for Gemma-2-9B and Gemma-2-2B, and it is constant across eight depths while absolute attribution thresholds drop by a factor of twenty.
What carries the argument
Causal dimensionality kappa, defined as the effective rank of the expected Jacobian outer product at layer L, recovered by sweeping SAE widths and applying attribution patching to isolate causal influence on model outputs.
If this is right
- Full causal recovery requires SAE widths substantially larger than the recovered kappa value.
- Increasing total model parameters does not raise the causal dimensionality of any layer.
- Deeper layers maintain the same causal rank even though their individual attribution scores become smaller.
- The gap between representational and causal capacity persists across architecture controls and threshold choices.
Where Pith is reading between the lines
- Methods that aim to edit or interpret model behavior may only need to track a few hundred causal directions per layer rather than the full residual stream.
- The constancy of kappa across depth suggests early layers already compress causal dependencies to a stable low-dimensional form.
- Testing whether kappa changes with task complexity or dataset statistics would show whether it is purely architectural or partly data-dependent.
Load-bearing premise
The combination of SAE width sweeps and attribution patching recovers the true effective rank of the expected Jacobian outer product without large bias from training dynamics, patching interference, or the fixed 2 percent calibration cutoff.
What would settle it
A change in the shape of the attribution patching score distribution or a different recovered N_causal when the same SAE widths are applied to models larger than those tested would falsify the scaling-invariance claim.
Figures
read the original abstract
Sparse autoencoders (SAEs) decompose transformer residual streams into interpretable feature dictionaries, yet the relationship between SAE width and causal influence on model output has not been systematically characterised. We introduce causal dimensionality kappa(L, M, T), defined as the effective rank of the expected Jacobian outer product at layer L, and show it can be estimated via the SAE width sweep paired with attribution patching. Across seven SAE widths from 16,384 to 1,048,576 features on Gemma-2-2B layer 12, representational capacity grows 15.6x while causal capacity grows only 4.35x: a robust separation we term the representational-causal wedge. A saturating fit yields kappa-hat approximately 1,990 with kappa-hat / d_model = 0.86 and participation-ratio lower bound kappa_PR approximately 280. Crucially, kappa is invariant to model scaling: Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 at the same SAE width despite a 3.46x parameter increase (the count is forced to 2% of SAE width by calibration; the substantive empirical claim is shape invariance of the AtP score distribution under matched seq=512 conditions). Across eight network depths kappa is constant while the absolute attribution threshold drops 20x from layer 1 to layer 23. Five controls (architecture invariance, threshold robustness, geometric privilege, synthetic ground-truth recovery, and a four-cell encoder/decoder ablation) pin down what kappa measures and what it does not. Our findings establish kappa as a measurable, model-intrinsic property of transformer layers: sub-linearly recoverable by SAE width, invariant to model scaling, and structured across network depth.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces causal dimensionality kappa(L, M, T) as the effective rank of the expected Jacobian outer product at transformer layer L, estimated via SAE width sweeps paired with attribution patching. It reports a representational-causal wedge in which SAE width increases representational capacity 15.6x but causal capacity only 4.35x on Gemma-2-2B layer 12; kappa is invariant to model scaling (identical N_causal=328 for Gemma-2-9B and 2B at matched widths despite 3.46x parameter difference); and kappa remains constant across eight depths while absolute attribution thresholds drop 20x. Five controls (architecture invariance, threshold robustness, geometric privilege, synthetic ground-truth recovery, encoder/decoder ablation) are used to characterize what kappa measures.
Significance. If the width-sweep plus attribution-patching procedure recovers an unbiased estimate of effective rank, the work would establish a new intrinsic, model-scale-invariant property of transformer layers that separates representational from causal capacity and is structured by depth. The explicit listing of five controls, including synthetic ground-truth recovery, and the direct reporting of the wedge constitute strengths that would make the result useful for interpretability and scaling research.
major comments (3)
- [Abstract] Abstract: the reported invariance to model scaling rests on identical N_causal=328 at the same SAE width for Gemma-2-9B and Gemma-2-2B, yet N_causal is explicitly forced to 2% of SAE width by calibration; while the text correctly identifies shape invariance of the AtP score distribution as the substantive claim, no quantitative comparison (e.g., distribution overlap or statistical test) of the AtP scores across the two models is provided to demonstrate that the shapes are in fact invariant rather than an artifact of the shared calibration rule.
- [Abstract] Abstract: the saturating fit that yields kappa-hat approximately 1,990 (with kappa-hat / d_model = 0.86) is applied post-hoc to the seven-width sweep data; without the explicit functional form, fitted parameters, or goodness-of-fit diagnostics for this saturating model, it is impossible to assess whether the reported value reflects a genuine property of the Jacobian or is shaped by the choice of saturating function and width range (16,384 to 1,048,576).
- [Abstract] Abstract: the central methodological assumption—that the SAE width sweep combined with attribution patching recovers the effective rank of the expected Jacobian outer product without substantial bias from SAE training dynamics, reconstruction error, or patching interference—is load-bearing for all scaling and depth claims. Although five controls are listed, including synthetic ground-truth recovery, it is not shown that these controls were executed under the precise experimental conditions (seq=512, multi-layer residual streams) used for the main Gemma experiments, leaving open the possibility of differential bias across model scales and depths.
minor comments (2)
- The notation kappa(L, M, T) is introduced without an explicit equation or definition of its dependence on sequence length T; a short formal definition would improve clarity.
- The participation-ratio lower bound kappa_PR approximately 280 is stated but its exact computation and relation to the main kappa-hat estimate are not expanded; a brief derivation or reference would help readers connect the two quantities.
Simulated Author's Rebuttal
Thank you for the thoughtful review. We have carefully considered each major comment and provide point-by-point responses below. Where the comments highlight areas for improvement in presentation or additional evidence, we commit to revisions in the next version of the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: the reported invariance to model scaling rests on identical N_causal=328 at the same SAE width for Gemma-2-9B and Gemma-2-2B, yet N_causal is explicitly forced to 2% of SAE width by calibration; while the text correctly identifies shape invariance of the AtP score distribution as the substantive claim, no quantitative comparison (e.g., distribution overlap or statistical test) of the AtP scores across the two models is provided to demonstrate that the shapes are in fact invariant rather than an artifact of the shared calibration rule.
Authors: We agree that a quantitative comparison would strengthen the claim of shape invariance. In the revised manuscript, we will add a supplementary figure comparing the AtP score distributions for Gemma-2-9B and Gemma-2-2B at matched SAE widths under seq=512 conditions, including metrics such as distribution overlap (e.g., Jensen-Shannon divergence) or a statistical test to confirm similarity independent of the calibration rule. revision: yes
-
Referee: [Abstract] Abstract: the saturating fit that yields kappa-hat approximately 1,990 (with kappa-hat / d_model = 0.86) is applied post-hoc to the seven-width sweep data; without the explicit functional form, fitted parameters, or goodness-of-fit diagnostics for this saturating model, it is impossible to assess whether the reported value reflects a genuine property of the Jacobian or is shaped by the choice of saturating function and width range (16,384 to 1,048,576).
Authors: We acknowledge the need for greater transparency in the fitting procedure. The revised manuscript will include the explicit functional form of the saturating model, the fitted parameters, and goodness-of-fit diagnostics such as R² and residual analysis to allow readers to assess whether the kappa estimate reflects properties of the Jacobian rather than the choice of function or width range. revision: yes
-
Referee: [Abstract] Abstract: the central methodological assumption—that the SAE width sweep combined with attribution patching recovers the effective rank of the expected Jacobian outer product without substantial bias from SAE training dynamics, reconstruction error, or patching interference—is load-bearing for all scaling and depth claims. Although five controls are listed, including synthetic ground-truth recovery, it is not shown that these controls were executed under the precise experimental conditions (seq=512, multi-layer residual streams) used for the main Gemma experiments, leaving open the possibility of differential bias across model scales and depths.
Authors: This is a valid point regarding the scope of the controls. We will revise the manuscript to explicitly state the experimental conditions for each of the five controls. The synthetic ground-truth recovery was performed in a controlled synthetic setting to validate rank recovery, while the remaining controls (architecture invariance, threshold robustness, geometric privilege, and encoder/decoder ablation) were applied to the Gemma residual streams. We will add a clarifying table or section and discuss any limitations in generalizing across scales and depths. revision: partial
Circularity Check
N_causal invariance to scaling reduces to 2% calibration rule at matched SAE widths
specific steps
-
self definitional
[Abstract]
"Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 at the same SAE width despite a 3.46x parameter increase (the count is forced to 2% of SAE width by calibration; the substantive empirical claim is shape invariance of the AtP score distribution under matched seq=512 conditions)"
N_causal is defined as the SAE width point where attribution scores hit the 2% calibration threshold. Matching SAE widths across models therefore produces identical N_causal by construction, independent of any actual invariance in the underlying expected Jacobian outer product rank. The invariance claim for kappa thus reduces to the shared calibration rule rather than emerging from the data.
full rationale
The paper's central claims rest on estimating kappa (effective rank of expected Jacobian outer product) via SAE width sweeps + attribution patching, with N_causal explicitly calibrated to 2% of SAE width. This calibration directly forces identical N_causal values when the same widths are used across models, rendering the reported invariance to 3.46x scaling a consequence of the procedure rather than an independent derivation from Jacobian properties. The paper acknowledges the forcing but pivots to AtP score shape invariance; however, the headline results (kappa invariance, representational-causal wedge, depth constancy) inherit this construction without external benchmarks confirming unbiased rank recovery. No load-bearing self-citations or ansatz smuggling detected, but the measurement chain exhibits partial circularity per the fitted/calibrated input pattern.
Axiom & Free-Parameter Ledger
free parameters (2)
- saturating-fit parameters =
approximately 1990
- calibration factor for N_causal =
2% of SAE width
axioms (2)
- domain assumption Sparse autoencoders decompose transformer residual streams into interpretable feature dictionaries whose width controls representational capacity.
- domain assumption Attribution patching measures causal influence on model output.
invented entities (1)
-
causal dimensionality kappa
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
κ is invariant to model scaling: Gemma-2-9B and Gemma-2-2B yield identical N_causal = 328 ... Across eight network depths κ is constant
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Ansuini, A., Laio, A., Macke, J. H., and Zoccolan, D. (2019). Intrinsic dimension of data representations in deep neural networks. In NeurIPS
work page 2019
-
[2]
Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling
Biderman, S., Schoelkopf, H., Anthony, Q., Bradley, H., O'Brien, K., Hallahan, E., Khan, M. A., Purohit, S., Prashanth, U. S., Raff, E., et al. (2023). Pythia: A Suite for Analyzing Large Language Models Across Training and Scaling. arXiv:2304.01373
work page internal anchor Pith review arXiv 2023
-
[3]
Bricken, T., Templeton, A., Batson, J., Chen, B., Jermyn, A., et al. (2023). Towards Monosemanticity: Decomposing Language Models With Dictionary Learning. Transformer Circuits Thread
work page 2023
-
[4]
Chanin, D., Lloyd, C., Heimersheim, S., and Hooker, S. (2025). A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders. In ICLR
work page 2025
-
[5]
Cheng, M., Diaz, M., et al. (2025). Emergence of a High-Dimensional Abstraction Phase in Language Transformers. In ICLR
work page 2025
-
[6]
Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. (2023). Towards Automated Circuit Discovery for Mechanistic Interpretability. In NeurIPS
work page 2023
-
[7]
Dubey, A., Jauhri, A., et al. (2024). The Llama 3 Herd of Models. arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[8]
Elhage, N., Hume, T., Olsson, C., et al. (2022). Toy Models of Superposition. Transformer Circuits Thread
work page 2022
-
[9]
Feng, R., Zheng, K., Huang, Y., Zhao, D., Jordan, M., and Zha, Z.-J. (2022). Rank Diminishing in Deep Neural Networks. In NeurIPS
work page 2022
-
[10]
Scaling and evaluating sparse autoencoders
Gao, L., la Tour, T. D., Tillman, H., Goh, G., Troll, R., Radford, A., Sutskever, I., Leike, J., and Wu, J. (2024). Scaling and evaluating sparse autoencoders. arXiv:2406.04093
work page internal anchor Pith review arXiv 2024
-
[11]
Gemma Team. (2024). Gemma 2: Improving Open Language Models at a Practical Size. arXiv:2408.00118
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
Karvonen, A., et al. (2025). SAEBench: A Comprehensive Benchmark for Sparse Autoencoders. In ICML
work page 2025
- [13]
-
[14]
Makhzani, A. and Frey, B. (2013). k-Sparse Autoencoders. arXiv:1312.5663
work page Pith review arXiv 2013
-
[15]
J., Belinkov, Y., Bau, D., and Mueller, A
Marks, S., Rager, C., Michaud, E. J., Belinkov, Y., Bau, D., and Mueller, A. (2025). Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs. In ICLR
work page 2025
-
[16]
J., Liu, Z., Girit, U., and Tegmark, M
Michaud, E. J., Liu, Z., Girit, U., and Tegmark, M. (2023). The Quantization Model of Neural Scaling. In NeurIPS
work page 2023
-
[17]
Nanda, N. (2022a). Attribution Patching: Activation Patching at Industrial Scale. https://www.neelnanda.io/mechanistic-interpretability/attribution-patching
-
[18]
Nanda, N. (2022b). The Pile-10k Subset. https://huggingface.co/datasets/NeelNanda/pile-10k
-
[19]
Park, K., Choe, Y. J., and Veitch, V. (2024). The Linear Representation Hypothesis and the Geometry of Large Language Models. In ICML
work page 2024
- [20]
-
[21]
Roy, O. and Vetterli, M. (2007). The effective rank: A measure of effective dimensionality. In EUSIPCO
work page 2007
-
[22]
Templeton, A., Conerly, T., Marcus, J., Lindsey, J., Bricken, T., et al. (2024). Scaling Monosemanticity: Extracting Interpretable Features from Claude 3 Sonnet. Transformer Circuits Thread
work page 2024
-
[23]
Wang, K., Variengien, A., Conmy, A., Shlegeris, B., and Steinhardt, J. (2023). Interpretability in the Wild: A Circuit for Indirect Object Identification in GPT-2 Small. In ICLR
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.