Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

Lev Stambler; Silen Naihin

arxiv: 2606.15054 · v2 · pith:LOWOSRWInew · submitted 2026-06-13 · 💻 cs.LG

Size Doesn't Matter: Cosine-Scored Sparse Autoencoders

Silen Naihin , Lev Stambler This is my paper

Pith reviewed 2026-07-01 07:12 UTC · model grok-4.3

classification 💻 cs.LG

keywords sparse autoencodersfeature interpretabilitycosine similaritydictionary learningactivation normsneural network featuresrepresentation learning

0 comments

The pith

Cosine-scored sparse autoencoders learn more human-recognizable features than inner-product ones at matched reconstruction quality.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard sparse autoencoders score features with inner product, so activation strength depends on both directional match and token magnitude. Because sublayer normalization has already removed magnitude from what the model sees, this means many dictionary slots get used for norm detectors instead of content features. The paper replaces the score with a learned combination of cosine similarity and magnitude, allowing the optimizer to decide the mix. Training never selects more than half-magnitude weight, and the resulting dictionary contains fewer pure norm features and more concepts that humans recognize. The gap persists even after loss reweighting, showing the forward-pass scoring geometry itself is the cause.

Core claim

Replacing the inner-product score in sparse autoencoders with a learned blend of cosine similarity and input magnitude causes the optimizer to rely almost entirely on cosine similarity. No feature ever assigns more than half its weight to magnitude. At equal reconstruction loss, the cosine version produces substantially more features that align with human-recognizable concepts and fewer that activate only on token norm.

What carries the argument

The cosine-scored encoder, a learned linear combination of cosine similarity and input magnitude used in place of inner product for feature activation.

If this is right

Dictionary capacity is no longer spent on norm-only detectors.
More slots become available for directional, content-based features.
Loss reweighting alone cannot replicate the gain; the scoring function geometry matters.
The improvement is observed across some but not all depths and tasks.
Cosine scoring is presented as the default choice for normalized representations.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Any dictionary-learning method on normalized activations could benefit from directional scoring to reduce wasted capacity.
The same scoring change might improve other sparse coding techniques that currently rely on inner products.
A direct test would be to run both scorers on the same non-normalized activations and compare feature quality.
If the pattern holds, training pipelines for interpretability work could default to cosine scoring without added cost.

Load-bearing premise

Sublayer normalization has already removed all magnitude information that the model actually uses, rendering norm detection pointless.

What would settle it

Measure interpretability on a non-normalized layer or model where magnitude still carries signal; if cosine scoring loses its advantage there, the claim is falsified.

Figures

Figures reproduced from arXiv: 2606.15054 by Lev Stambler, Silen Naihin.

**Figure 1.** Figure 1: The cosine encoder: architecture and headline results. Left: Standard SAE encoder computes ⟨wi, xc⟩, coupling alignment with norm. Cosine encoder unit-normalizes wi and replaces the score with e b ∥xc∥ a cos(xc, wi) + benc,i; a interpolates between pure cosine (a=0) and inner product (a=1). Right: Matched reconstruction (FVE ≈ 0.77) with +14.6% sparse-probing top-1 (mean over three SAE-training seeds). Tra… view at source ↗

**Figure 2.** Figure 2: Cosine-scored SAEs win on probing because standard features fire on token norm. (A) Result: sparse-probing top-1 across eight tasks (Qwen3-8B L18, 500M tokens, dsae=65,536, matched FVE ≈ 0.77). Per-feature cosine wins on 7/8 tasks; sentiment is the only exception. (B) Cause: standard’s unmatched features (those with no nearest-neighbor counterpart in cosine’s dictionary) fire 22× more on the highest-norm t… view at source ↗

**Figure 3.** Figure 3: Score-surface geometry. Each panel plots the encoder pre-activation ∥xc∥ a cos(xc, wi) over alignment (x-axis) and input norm (y-axis). Black curves join equally-scored pairs. Left: a=1 (inner product); hyperbolic curves, high-norm tokens outscore better-aligned low-norm ones. Center: a=0 (cosine); vertical curves, norm ignored. Right: global learned a≈0.26; mild tilt, close to cosine. Bottom: per-feature … view at source ↗

**Figure 4.** Figure 4: Aggregate sparse-probing accuracy. Top-1, top-2, and top-5 probe accuracy across all eight SAEBench datasets. FVEmatched at ≈ 0.77. Black: standard SAE; violet: per-feature cosine encoder. The gap narrows at higher k but remains large (+9.4% at top-5). Per-dataset breakdown [PITH_FULL_IMAGE:figures/full_fig_p005_4.png] view at source ↗

**Figure 5.** Figure 5: Discovery dominates separability. Sparse-probing accuracy when each SAE uses only features shared with the other dictionary (“shared features”) versus its full dictionary (“all features”). Standard’s flat slope shows its unique features add no probe signal; cosine’s steep rise shows its unique features encode interpretable concepts. The gap on the right is the total probing advantage, driven almost entire… view at source ↗

**Figure 7.** Figure 7: Cosine features ablate more cleanly. Ablating the top-N probe-selected features and decomposing the logit change into intended (target concept) and unintended (collateral) effects. Left: precision (intended/collateral) rises with N for cosine but collapses toward 1× for standard. Right: the cause is collateral, not intended effect; standard’s unintended damage grows two orders of magnitude while cosine’s s… view at source ↗

**Figure 9.** Figure 9: Model size, not expansion ratio, drives the cosine advantage. Sparse-probing top-1 gap (cosine − standard, in percentage points; mean ± SD over three SAE-training seeds) across the Qwen3 family. Row means (right) show a ∼ 2× jump from 1.7B to ≥4B; column means (bottom) are flat. All cells use the community SAEBench recipe, 50M tokens, k = 80. [exp67] preserves magnitude-correlated structure that RMSNorm e… view at source ↗

**Figure 10.** Figure 10: Winner-take-all cascade in unconstrained per-feature ai. Qwen3-8B L27, 50M tokens. (A) Unconstrained per-feature (red) loses 67% of features in a single 500-step window at step 5,500; base+delta (blue) retains all features throughout. (B) Base+delta also achieves higher final FVE (0.77 vs. 0.72) because surviving features encode useful content rather than fighting for norm-dominated TopK slots. [exp47] Re… view at source ↗

**Figure 11.** Figure 11: Magnitude-bypass family: norm restoration is load-bearing; encoder/decoder normalization is decorative. L27, 5M tokens, aux-k on (confirmed no-op). All four restoration-on variants (blue) achieve FVE ≈ 0.55 at 0% dead regardless of enc/dec norm. Restoration-off variants (red) collapse to ≥ 88% dead. For the adaptive family, encoder normalization plays the equivalent stabilizing role (see text above). [exp… view at source ↗

**Figure 12.** Figure 12: Group-size sweep. Gemma-2-2B L13, 50M tokens, dsae = 9,216. (A) Dead-feature rate decreases monotonically with group size. (B) KL-score and RAVEL peak at G = 4 (star), though differences are small (< 0.002 KL-score between adjacent sizes). The result is suggestive of an intermediate optimum but has not been replicated at the headline scale. [exp37] In practice, we use per-feature base+delta rather than gr… view at source ↗

**Figure 13.** Figure 13: Per-feature ai distribution at three token budgets (Qwen3-8B L27). Mass shifts toward larger ai with more data; no ai > 0.5 in any setting. [exp42c] [PITH_FULL_IMAGE:figures/full_fig_p018_13.png] view at source ↗

**Figure 14.** Figure 14: Win-rate diagnostic protocol. (a) Projection ablation removes a decoder direction from the residual stream; the downstream KL-divergence measures its causal importance. (b) Per-sample cosine and inner-product scores are correlated with this causal effect; the scoring rule with higher absolute correlation “wins” for that feature. Depth dependence on RMSNorm. Qwen3-8B: L9 ≈ 44%, L18 ≈ 65%, L27 ≈ 78%. Norm-p… view at source ↗

**Figure 15.** Figure 15: Cosine-vs.-inner-product win rate by depth. Deep RMSNorm: 70–90%. Deep LayerNorm: at chance. [exp25] C.1. SAE-Free Direction vs. Norm Patching The cos>inner diagnostic in §C relies on SAE decoder directions. This subsection removes that dependency entirely: for 500 random cross-prompt token pairs at each layer, we decompose the residual-stream activation into direction (xˆ = x/∥x∥) and magnitude (∥x∥), th… view at source ↗

**Figure 16.** Figure 16: Simpson’s paradox in the cos>inner diagnostic. Within individual norm quartiles (blue), the win rate hovers at 40–70%; the overall rate (red) is 80–87%. The gap grows with depth as residual-stream norm variation increases. This confirms that the betweenquartile component (norm variation corrupting cross-token TopK selection) drives most of the overall cos>inner rate. Standard SAE, Qwen3-8B, 50M tokens. [… view at source ↗

**Figure 17.** Figure 17: The model reads direction, not magnitude. KL-divergence caused by swapping direction (blue) vs. norm (orange) between random token pairs at three Qwen3-8B layers. Direction patches cause 87–2,560× more disruption; the ratio grows with depth as fewer subsequent layers remain for magnitude to influence direction. n = 500 pairs per layer. [exp8] [PITH_FULL_IMAGE:figures/full_fig_p023_17.png] view at source ↗

**Figure 18.** Figure 18: Per-feature top-1 by dataset (Standard vs. Per-Feature Adaptive Cosine SAE) and aggregate top-k. [exp40] 2. Cos>inner is below 50% at L9. Inner product is a better causal predictor at L9 (33–44%), consistent with magnitude carrying genuine signal at shallow layers where RMSNorm has not yet accumulated. The diagnostic crosses 50% between L9 and L18, and reaches 67–78% at L27. 3. Per-feature (no anchor) col… view at source ↗

**Figure 19.** Figure 19: FVE vs. token budget on Qwen3-8B (500M, our recipe). Adaptive Cosine SAE at the 50M checkpoint reaches Standard’s 500M FVE at L9. [exp42c] The reference SAE matches our own standard baseline within noise at L18 (both ≈ 0.679 at 500M), confirming it is representative. The cross-budget gap is largest at L9, where the cosine encoder’s direction-only scoring is most beneficial relative to the shallow-layer no… view at source ↗

**Figure 20.** Figure 20: Architecture behavior across depth. Qwen3-8B, 50M tokens, all four main variants. (A) FVE converges at L9 and diverges at L27. (B) Per-feature (no anchor) collapses catastrophically at L27; magnitude-bypass shows mild dead features at L18. (C) Cos>inner crosses 50% between L9 and L18, reaching 78% for magnitude-bypass at L27. (D) The optimizer drives a toward zero at shallow layers and toward ∼ 0.26 at de… view at source ↗

**Figure 21.** Figure 21: Without auxiliary loss, FVE and dead-feature rate by layer (Qwen3-8B, 50M tokens). At 500M tokens the L18 dead-feature gap (Per-Feature Adaptive Cosine SAE vs. Standard) is 1.9%. [exp17] [exp42c] [PITH_FULL_IMAGE:figures/full_fig_p027_21.png] view at source ↗

**Figure 22.** Figure 22: Per-feature interpretability rates are matched across architectures; when the alive-feature gap is open (no auxiliary loss), the cosine encoder nonetheless yields more interpretable features in total because more features are alive. [exp62] F.3. Concrete Feature Examples [PITH_FULL_IMAGE:figures/full_fig_p028_22.png] view at source ↗

**Figure 23.** Figure 23: Encoder-gradient ratio Q4 (high-norm) / Q1 (low-norm). Standard: 35.3% of features have Q4/Q1 > 2. Per-feature cosine: 13.5%. [exp28] [PITH_FULL_IMAGE:figures/full_fig_p030_23.png] view at source ↗

**Figure 24.** Figure 24: Per-quartile FVE (top) and reconstruction-norm ratio ∥xˆ∥/∥x∥ (bottom). [exp55c] [PITH_FULL_IMAGE:figures/full_fig_p031_24.png] view at source ↗

**Figure 25.** Figure 25: Larger dictionary does not fix dead features. Qwen3-8B L27, 50M tokens. (A) Tripling dictionary slots at the same L0 increases the dead rate from 77.4% to 89.4%; the cosine SAE at 1/3 the dictionary achieves 28.2%. (B) Matching cosine’s FVE requires 3× parameters and 3× L0. The dead-feature problem is a scoring-function pathology, not a capacity limitation. [exp26] and thresholds rather than re-sorting fe… view at source ↗

**Figure 26.** Figure 26: Gradient asymmetry exists, but does not explain the probing gap. (A) Median per-feature Q4/Q1 gradient ratio across Qwen layers (10M tokens). (B) Equalizing per-quartile gradients in the standard encoder by reweighting the reconstruction loss (1/∥xc∥; 50M tokens) negligibly shifts top-1. Per-feature ratio definition and full reweighting sweep in text below. Architecture-specific failure modes. Without aux… view at source ↗

read the original abstract

Sparse autoencoders (SAEs) detect features via inner product, so a feature's activation scales with both its directional alignment and the input's norm. Features that fire on token norm therefore claim dictionary slots regardless of content alignment. This matters because sublayer normalization has already discarded the magnitude the score measures, so the encoder detects a quantity the model does not read. We replace the score with a learned blend of cosine similarity and input magnitude, letting the optimizer choose how much norm to use; a per-feature extension lets each feature decide independently. In both regimes, training is free to recover inner product but never does, with no feature ever choosing more than half-magnitude dependence. At matched reconstruction, the cosine encoder learns features that align with human-recognizable concepts far more often than standard, filling dictionary slots that inner product wastes on norm detectors. Loss reweighting that equalizes gradients barely closes the gap, confirming forward-pass score geometry as the lever. The advantage is not universal across tasks or depths, but we believe cosine scoring should be the default for dictionary learning on normalized representations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Cosine scoring in SAEs is a minor change that training prefers and that seems to yield more interpretable features in their runs, but the motivation about norm detectors is inconsistent with the normalization the paper itself invokes.

read the letter

The paper replaces the standard inner-product score in SAEs with a learned blend of cosine similarity and input magnitude, either globally or per feature. Training never recovers the pure inner-product limit and caps magnitude weight at half; at matched reconstruction the cosine version produces more features that look like human-recognizable concepts. They also run a control showing that simply reweighting the loss to balance gradients does not close the gap, which points to the forward-pass geometry as the operative factor.

That is the concrete addition: an optimizer-chosen blend that stays away from the usual score. The per-feature version is a small but natural extension.

The main weakness is the stated reason for the improvement. The abstract argues that inner-product SAEs waste capacity on norm detectors because the score mixes direction with magnitude, yet it also states that sublayer normalization has already removed magnitude variation. With fixed input norm there is no varying magnitude signal left for any feature to detect, so the “norm detector” account does not hold. The empirical pattern may still be real, but it needs a different explanation.

The advantage is also described as non-universal across tasks and depths. For readers already running SAEs on normalized activations this is a low-cost experiment worth trying; the paper is narrow enough that it does not need to be a default. The work shows clear experimental thinking and should go to referees so the numbers and the motivation can be checked directly.

Referee Report

2 major / 1 minor

Summary. The paper proposes cosine-scored (or blended cosine-magnitude) sparse autoencoders as a replacement for standard inner-product SAEs. It argues that inner-product scoring wastes dictionary capacity on norm detectors because activations scale with input magnitude, which sublayer normalization has already removed from the model's view; cosine scoring avoids this and yields more human-recognizable features at matched reconstruction error. Empirical results show the optimizer never recovers full inner-product behavior, loss reweighting does not close the gap, and the advantage is not universal across tasks or depths.

Significance. If the empirical advantage in feature interpretability holds under controlled conditions, the work would provide a simple, low-cost modification to dictionary learning that improves alignment with human concepts without sacrificing reconstruction. The observation that training never selects full magnitude dependence is a useful negative result. However, the claimed mechanism is undermined by the normalization premise itself.

major comments (2)

[Abstract] Abstract and introduction: the core motivation states that inner-product SAEs 'waste dictionary slots on norm detectors' because 'sublayer normalization has already discarded the magnitude the score measures.' After LayerNorm/RMSNorm, ||x|| is fixed at approximately √d for every token, so f·x reduces to a constant times ||f||·cos(θ) with zero variance in the magnitude term. No feature can selectively detect 'token norm' because the quantity has no variance across inputs. This renders the stated reason for preferring cosine scoring internally inconsistent with the normalization premise used to motivate the work.
[Abstract] The central empirical claim (cosine scoring produces more human-recognizable concepts at matched reconstruction) rests on the above motivation. Without a corrected account of why inner-product scoring underperforms, it is unclear whether the observed difference is due to score geometry or to some other uncontrolled factor in the training setup.

minor comments (1)

[Abstract] Abstract supplies no quantitative metrics, dataset details, statistical tests, or operational definition of 'human-recognizable concepts,' making the strength of the reported advantage impossible to assess from the summary alone.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. We agree that the motivational framing in the abstract and introduction contains an inconsistency regarding norm detection and will revise it. We address the comments point by point below.

read point-by-point responses

Referee: Abstract and introduction: the core motivation states that inner-product SAEs 'waste dictionary slots on norm detectors' because 'sublayer normalization has already discarded the magnitude the score measures.' After LayerNorm/RMSNorm, ||x|| is fixed at approximately √d for every token, so f·x reduces to a constant times ||f||·cos(θ) with zero variance in the magnitude term. No feature can selectively detect 'token norm' because the quantity has no variance across inputs. This renders the stated reason for preferring cosine scoring internally inconsistent with the normalization premise used to motivate the work.

Authors: We acknowledge that the original motivation is internally inconsistent as stated. Since the input norm is constant after normalization, there is no variance for features to detect 'token norm'. The actual issue with inner-product scoring is that the activation is scaled by the learned feature norm ||f||, which can cause the optimizer to favor features with larger norms irrespective of their conceptual alignment. This wastes dictionary capacity on features that are not purely direction-based. We will revise the abstract and introduction to provide a corrected account of why inner-product scoring underperforms, focusing on the role of feature norm in the score rather than input norm detection. revision: yes
Referee: The central empirical claim (cosine scoring produces more human-recognizable concepts at matched reconstruction) rests on the above motivation. Without a corrected account of why inner-product scoring underperforms, it is unclear whether the observed difference is due to score geometry or to some other uncontrolled factor in the training setup.

Authors: While the motivation requires correction, the empirical results stand on their own. We demonstrate through multiple experiments that the optimizer does not recover inner-product behavior, that reweighting the loss to balance gradients does not eliminate the advantage of cosine scoring, and that the improvement in feature interpretability occurs at matched reconstruction quality. These controls suggest the difference arises from the score geometry in the forward pass. We will update the manuscript to present the empirical findings with the revised mechanistic explanation. revision: partial

Circularity Check

0 steps flagged

No circularity; results are empirical comparisons

full rationale

The paper's claims rest on training multiple SAEs under different scoring regimes and reporting reconstruction quality plus human-interpretable feature counts. No derivation chain, fitted parameter, or self-citation is presented as a 'prediction' that reduces to the input by construction. The norm-detector motivation is an interpretive premise, not a load-bearing mathematical step that tautologically produces the reported advantage.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach rests on one domain assumption about normalization and introduces a single learned scalar (or per-feature scalars) that controls the cosine-magnitude trade-off.

free parameters (1)

blend weight (global or per-feature)
Scalar(s) optimized during training that determine how much magnitude information enters the score.

axioms (1)

domain assumption Sublayer normalization has already discarded the magnitude the score measures, so the encoder detects a quantity the model does not read.
Explicitly stated in the abstract as the reason inner-product scoring is mismatched.

pith-pipeline@v0.9.1-grok · 5716 in / 1218 out tokens · 45958 ms · 2026-07-01T07:12:20.661343+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 20 canonical work pages · 4 internal anchors

[1]

Root Mean Square Layer Normalization

Root Mean Square Layer Normalization , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 1910.07467 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 1910
[2]

Findings of the Association for Computational Linguistics: EMNLP , year =

Query-Key Normalization for Transformers , author =. Findings of the Association for Computational Linguistics: EMNLP , year =. 2010.04245 , archivePrefix =

work page arXiv 2010
[3]

Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality , author =. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =. 2109.04404 , archivePrefix =

work page arXiv
[4]

2022 , eprint =

Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , booktitle =. 2022 , eprint =

2022
[5]

The Linear Representation Hypothesis and the Geometry of Large Language Models

The Linear Representation Hypothesis and the Geometry of Large Language Models , author =. International Conference on Machine Learning (ICML) , year =. 2311.03658 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[6]

International Conference on Learning Representations (ICLR) , year =

The Geometry of Categorical and Hierarchical Concepts in Large Language Models , author =. International Conference on Learning Representations (ICLR) , year =. 2406.01506 , archivePrefix =

work page arXiv
[7]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...

work page internal anchor Pith review Pith/arXiv arXiv
[8]

2023 , eprint =

Linear Representations of Sentiment in Large Language Models , author =. 2023 , eprint =

2023
[9]

2023 , howpublished =

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author =. 2023 , howpublished =

2023
[10]

2023 , eprint =

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author =. 2023 , eprint =

2023
[11]

2022 , eprint =

Toy Models of Superposition , author =. 2022 , eprint =

2022
[12]

2024 , eprint =

Scaling and Evaluating Sparse Autoencoders , author =. 2024 , eprint =

2024
[13]

2024 , eprint =

Improving Dictionary Learning with Gated Sparse Autoencoders , author =. 2024 , eprint =

2024
[14]

Jumping Ahead: Improving Reconstruction Fidelity with

Rajamanoharan, Senthooran and Lieberum, Tom and Sonnerat, Nicolas and Conmy, Arthur and Varma, Vikrant and Kram. Jumping Ahead: Improving Reconstruction Fidelity with. 2024 , eprint =

2024
[15]

Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410,

Bussmann, Bart and Leask, Patrick and Nanda, Neel , year =. 2412.06410 , archivePrefix =

work page arXiv
[16]

2025 , eprint =

Learning Multi-Level Features with Matryoshka Sparse Autoencoders , author =. 2025 , eprint =

2025
[17]

2025 , eprint =

Data Whitening Improves Sparse Autoencoder Learning , author =. 2025 , eprint =

2025
[18]

2025 , eprint =

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning , author =. 2025 , eprint =

2025
[19]

2510.00404 , archivePrefix =

Zhu, Xudong and Khalili, Mohammad Mahdi and Zhu, Zhihui , year =. 2510.00404 , archivePrefix =

work page arXiv
[20]

2509.22033 , archivePrefix =

Korznikov, Vladimir and Belrose, Nora and Sharkey, Lee , year =. 2509.22033 , archivePrefix =

work page arXiv
[21]

2602.12403 , archivePrefix =

Nasiri-Sarvi, Ali and others , year =. 2602.12403 , archivePrefix =

work page arXiv
[22]

2026 , howpublished =

2026
[23]

2025 , eprint =

Karvonen, Adam and Rager, Can and Lin, Johnny and Tigges, Curt and Bloom, Joseph and Chanin, David and Lau, Yeu-Tong and Farrell, Eoin and Conmy, Arthur and McDougall, Callum and Lo Piano, Federico and Templeton, Adly and Marks, Sam and Wright, Benjamin and Bricken, Trenton and Conerly, Tom and Smith, Lewis and Nanda, Neel , booktitle =. 2025 , eprint =

2025
[24]

2509.00691 , archivePrefix =

Gulko, Alex and Peng, Yusen and Kumar, Sachin , year =. 2509.00691 , archivePrefix =

work page arXiv
[25]

2026 , eprint =

Chanin, David and Garriga-Alonso, Adri. 2026 , eprint =

2026
[26]

AxBench: Steering LLMs? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D. and Potts, Christopher , year =. 2501.17148 , archivePrefix =

work page arXiv
[27]

arXiv preprint arXiv:2502.04878 , year=

Sparse Autoencoders Do Not Find Canonical Units of Analysis , author =. International Conference on Learning Representations (ICLR) , year =. 2502.04878 , archivePrefix =

work page arXiv
[28]

arXiv preprint arXiv:2409.14507 , year=

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2409.14507 , archivePrefix =

work page arXiv
[29]

2026 , eprint =

Falsifying Sparse Autoencoder Reasoning Features in Language Models , author =. 2026 , eprint =

2026
[30]

2025 , eprint =

Measuring Sparse Autoencoder Feature Sensitivity , author =. 2025 , eprint =

2025
[31]

International Conference on Machine Learning (ICML) , year =

Interpretability Illusions in the Generalization of Simplified Models , author =. International Conference on Machine Learning (ICML) , year =. 2312.03656 , archivePrefix =

work page arXiv
[32]

Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark

Not All Language Model Features Are One-Dimensionally Linear , author =. International Conference on Learning Representations (ICLR) , year =. 2405.14860 , archivePrefix =

work page arXiv
[33]

2026 , eprint =

From Directions to Regions: Decomposing Activations in Language Models via Local Geometry , author =. 2026 , eprint =

2026
[34]

2026 , eprint =

From Data Statistics to Feature Geometry: How Correlations Shape Superposition , author =. 2026 , eprint =

2026
[35]

2025 , eprint =

Provably Extracting the Features from a General Superposition , author =. 2025 , eprint =

2025
[36]

2026 , eprint =

Stable and Steerable Sparse Autoencoders with Weight Regularization , author =. 2026 , eprint =

2026
[37]

2026 , eprint =

Improving Robustness in Sparse Autoencoders via Masked Regularization , author =. 2026 , eprint =

2026
[38]

2026 , eprint =

Identifying Intervenable and Interpretable Features via Orthogonality Regularization , author =. 2026 , eprint =

2026
[39]

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

Koromilas, Panagiotis and others , year =. 2602.01322 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv
[40]

2026 , eprint =

From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders , author =. 2026 , eprint =

2026
[41]

2026 , eprint =

Language Model Circuits Are Sparse in the Neuron Basis , author =. 2026 , eprint =

2026
[42]

2026 , eprint =

Sparse Auto-Encoders and Holism about Large Language Models , author =. 2026 , eprint =

2026
[43]

2025 , eprint =

Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning , author =. 2025 , eprint =

2025
[44]

Kempf, Lukas and others , year =. Simple. 2602.10371 , archivePrefix =

work page arXiv
[45]

2026 , eprint =

Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints , author =. 2026 , eprint =

2026
[46]

2026 , eprint =

Cosine-Normalized Attention for Hyperspectral Image Classification , author =. 2026 , eprint =

2026
[47]

International Conference on Machine Learning (ICML) , year =

Scaling Vision Transformers to 22 Billion Parameters , author =. International Conference on Machine Learning (ICML) , year =. 2302.05442 , archivePrefix =

work page arXiv
[48]

2410.01131 , archivePrefix =

Loshchilov, Ilya and Hsieh, Cheng-Ping and Sun, Simeng and Ginsburg, Boris , year =. 2410.01131 , archivePrefix =

work page arXiv
[49]

2026 , eprint =

Step-Level Sparse Autoencoder for Reasoning Process Interpretation , author =. 2026 , eprint =

2026
[50]

2026 , eprint =

Interpretability without Actionability , author =. 2026 , eprint =

2026
[51]

2026 , eprint =

How Pruning Reshapes Features , author =. 2026 , eprint =

2026

[1] [1]

Root Mean Square Layer Normalization

Root Mean Square Layer Normalization , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 1910.07467 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv 1910

[2] [2]

Findings of the Association for Computational Linguistics: EMNLP , year =

Query-Key Normalization for Transformers , author =. Findings of the Association for Computational Linguistics: EMNLP , year =. 2010.04245 , archivePrefix =

work page arXiv 2010

[3] [3]

Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =

All Bark and No Bite: Rogue Dimensions in Transformer Language Models Obscure Representational Quality , author =. Proceedings of the Conference on Empirical Methods in Natural Language Processing (EMNLP) , year =. 2109.04404 , archivePrefix =

work page arXiv

[4] [4]

2022 , eprint =

Dettmers, Tim and Lewis, Mike and Belkada, Younes and Zettlemoyer, Luke , booktitle =. 2022 , eprint =

2022

[5] [5]

The Linear Representation Hypothesis and the Geometry of Large Language Models

The Linear Representation Hypothesis and the Geometry of Large Language Models , author =. International Conference on Machine Learning (ICML) , year =. 2311.03658 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

International Conference on Learning Representations (ICLR) , year =

The Geometry of Categorical and Hierarchical Concepts in Large Language Models , author =. International Conference on Learning Representations (ICLR) , year =. 2406.01506 , archivePrefix =

work page arXiv

[7] [7]

Representation Engineering: A Top-Down Approach to AI Transparency

Zou, Andy and Phan, Long and Chen, Sarah and Campbell, James and Guo, Phillip and Ren, Richard and Pan, Alexander and Yin, Xuwang and Mazeika, Mantas and Dombrowski, Ann-Kathrin and Goel, Shashwat and Li, Nathaniel and Byun, Michael J. and Wang, Zifan and Mallen, Alex and Basart, Steven and Koyejo, Sanmi and Song, Dawn and Fredrikson, Matt and Kolter, J. ...

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

2023 , eprint =

Linear Representations of Sentiment in Large Language Models , author =. 2023 , eprint =

2023

[9] [9]

2023 , howpublished =

Towards Monosemanticity: Decomposing Language Models with Dictionary Learning , author =. 2023 , howpublished =

2023

[10] [10]

2023 , eprint =

Sparse Autoencoders Find Highly Interpretable Features in Language Models , author =. 2023 , eprint =

2023

[11] [11]

2022 , eprint =

Toy Models of Superposition , author =. 2022 , eprint =

2022

[12] [12]

2024 , eprint =

Scaling and Evaluating Sparse Autoencoders , author =. 2024 , eprint =

2024

[13] [13]

2024 , eprint =

Improving Dictionary Learning with Gated Sparse Autoencoders , author =. 2024 , eprint =

2024

[14] [14]

Jumping Ahead: Improving Reconstruction Fidelity with

Rajamanoharan, Senthooran and Lieberum, Tom and Sonnerat, Nicolas and Conmy, Arthur and Varma, Vikrant and Kram. Jumping Ahead: Improving Reconstruction Fidelity with. 2024 , eprint =

2024

[15] [15]

Batchtopk sparse autoencoders.arXiv preprint arXiv:2412.06410,

Bussmann, Bart and Leask, Patrick and Nanda, Neel , year =. 2412.06410 , archivePrefix =

work page arXiv

[16] [16]

2025 , eprint =

Learning Multi-Level Features with Matryoshka Sparse Autoencoders , author =. 2025 , eprint =

2025

[17] [17]

2025 , eprint =

Data Whitening Improves Sparse Autoencoder Learning , author =. 2025 , eprint =

2025

[18] [18]

2025 , eprint =

Overcoming Sparsity Artifacts in Crosscoders to Interpret Chat-Tuning , author =. 2025 , eprint =

2025

[19] [19]

2510.00404 , archivePrefix =

Zhu, Xudong and Khalili, Mohammad Mahdi and Zhu, Zhihui , year =. 2510.00404 , archivePrefix =

work page arXiv

[20] [20]

2509.22033 , archivePrefix =

Korznikov, Vladimir and Belrose, Nora and Sharkey, Lee , year =. 2509.22033 , archivePrefix =

work page arXiv

[21] [21]

2602.12403 , archivePrefix =

Nasiri-Sarvi, Ali and others , year =. 2602.12403 , archivePrefix =

work page arXiv

[22] [22]

2026 , howpublished =

2026

[23] [23]

2025 , eprint =

Karvonen, Adam and Rager, Can and Lin, Johnny and Tigges, Curt and Bloom, Joseph and Chanin, David and Lau, Yeu-Tong and Farrell, Eoin and Conmy, Arthur and McDougall, Callum and Lo Piano, Federico and Templeton, Adly and Marks, Sam and Wright, Benjamin and Bricken, Trenton and Conerly, Tom and Smith, Lewis and Nanda, Neel , booktitle =. 2025 , eprint =

2025

[24] [24]

2509.00691 , archivePrefix =

Gulko, Alex and Peng, Yusen and Kumar, Sachin , year =. 2509.00691 , archivePrefix =

work page arXiv

[25] [25]

2026 , eprint =

Chanin, David and Garriga-Alonso, Adri. 2026 , eprint =

2026

[26] [26]

AxBench: Steering LLMs? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

Wu, Zhengxuan and Arora, Aryaman and Geiger, Atticus and Wang, Zheng and Huang, Jing and Jurafsky, Dan and Manning, Christopher D. and Potts, Christopher , year =. 2501.17148 , archivePrefix =

work page arXiv

[27] [27]

arXiv preprint arXiv:2502.04878 , year=

Sparse Autoencoders Do Not Find Canonical Units of Analysis , author =. International Conference on Learning Representations (ICLR) , year =. 2502.04878 , archivePrefix =

work page arXiv

[28] [28]

arXiv preprint arXiv:2409.14507 , year=

A is for Absorption: Studying Feature Splitting and Absorption in Sparse Autoencoders , author =. Advances in Neural Information Processing Systems (NeurIPS) , year =. 2409.14507 , archivePrefix =

work page arXiv

[29] [29]

2026 , eprint =

Falsifying Sparse Autoencoder Reasoning Features in Language Models , author =. 2026 , eprint =

2026

[30] [30]

2025 , eprint =

Measuring Sparse Autoencoder Feature Sensitivity , author =. 2025 , eprint =

2025

[31] [31]

International Conference on Machine Learning (ICML) , year =

Interpretability Illusions in the Generalization of Simplified Models , author =. International Conference on Machine Learning (ICML) , year =. 2312.03656 , archivePrefix =

work page arXiv

[32] [32]

Michaud, Isaac Liao, Wes Gurnee, and Max Tegmark

Not All Language Model Features Are One-Dimensionally Linear , author =. International Conference on Learning Representations (ICLR) , year =. 2405.14860 , archivePrefix =

work page arXiv

[33] [33]

2026 , eprint =

From Directions to Regions: Decomposing Activations in Language Models via Local Geometry , author =. 2026 , eprint =

2026

[34] [34]

2026 , eprint =

From Data Statistics to Feature Geometry: How Correlations Shape Superposition , author =. 2026 , eprint =

2026

[35] [35]

2025 , eprint =

Provably Extracting the Features from a General Superposition , author =. 2025 , eprint =

2025

[36] [36]

2026 , eprint =

Stable and Steerable Sparse Autoencoders with Weight Regularization , author =. 2026 , eprint =

2026

[37] [37]

2026 , eprint =

Improving Robustness in Sparse Autoencoders via Masked Regularization , author =. 2026 , eprint =

2026

[38] [38]

2026 , eprint =

Identifying Intervenable and Interpretable Features via Orthogonality Regularization , author =. 2026 , eprint =

2026

[39] [39]

PolySAE: Modeling Feature Interactions in Sparse Autoencoders via Polynomial Decoding

Koromilas, Panagiotis and others , year =. 2602.01322 , archivePrefix =

work page internal anchor Pith review Pith/arXiv arXiv

[40] [40]

2026 , eprint =

From Atoms to Trees: Building a Structured Feature Forest with Hierarchical Sparse Autoencoders , author =. 2026 , eprint =

2026

[41] [41]

2026 , eprint =

Language Model Circuits Are Sparse in the Neuron Basis , author =. 2026 , eprint =

2026

[42] [42]

2026 , eprint =

Sparse Auto-Encoders and Holism about Large Language Models , author =. 2026 , eprint =

2026

[43] [43]

2025 , eprint =

Dimensional Collapse in Transformer Attention Outputs: A Challenge for Sparse Dictionary Learning , author =. 2025 , eprint =

2025

[44] [44]

Kempf, Lukas and others , year =. Simple. 2602.10371 , archivePrefix =

work page arXiv

[45] [45]

2026 , eprint =

Why Linear Interpretability Works: Invariant Subspaces as a Result of Architectural Constraints , author =. 2026 , eprint =

2026

[46] [46]

2026 , eprint =

Cosine-Normalized Attention for Hyperspectral Image Classification , author =. 2026 , eprint =

2026

[47] [47]

International Conference on Machine Learning (ICML) , year =

Scaling Vision Transformers to 22 Billion Parameters , author =. International Conference on Machine Learning (ICML) , year =. 2302.05442 , archivePrefix =

work page arXiv

[48] [48]

2410.01131 , archivePrefix =

Loshchilov, Ilya and Hsieh, Cheng-Ping and Sun, Simeng and Ginsburg, Boris , year =. 2410.01131 , archivePrefix =

work page arXiv

[49] [49]

2026 , eprint =

Step-Level Sparse Autoencoder for Reasoning Process Interpretation , author =. 2026 , eprint =

2026

[50] [50]

2026 , eprint =

Interpretability without Actionability , author =. 2026 , eprint =

2026

[51] [51]

2026 , eprint =

How Pruning Reshapes Features , author =. 2026 , eprint =

2026