pith. sign in

arxiv: 2605.27819 · v1 · pith:2NV4JWTXnew · submitted 2026-05-27 · 💻 cs.LG · cs.AI

ReSAE: Residualized Sparse Autoencoders for Multi-Layer Transformer Interventions

Pith reviewed 2026-06-29 14:50 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords sparse autoencodersresidual streamtransformer interpretabilitymulti-layer interventionsaffine residualizationPythia modelGemma model
0
0 comments X

The pith

ReSAEs train sparse autoencoders on layer residuals after removing linearly predictable cross-layer information to enable better multi-layer interventions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard sparse autoencoders trained independently per layer often redundantly encode the same carried-forward activation information across transformer depth. ReSAEs address this by fitting an affine map between layers and training each SAE on the residual unexplained by that map. Reconstructions are then mapped back to the original space for use in interventions. On Pythia-1.4B and Gemma-2-9B, this yields lower decoder redundancy and higher performance on sparse probing and targeted perturbation, plus better cross-entropy recovery during multi-layer replacements despite reconstructing less raw variance. The approach shows that factoring out linear predictability leaves behind components more relevant to the model's computation.

Core claim

The central discovery is that training SAEs on residuals after affine residualization between layers produces dictionaries that, when used for multi-layer interventions, recover more of the transformer's original cross-entropy loss than unresidualized SAEs, even though they explain less activation variance. This holds on Pythia-1.4B and Gemma-2-9B models, with gains clearest under teacher-forcing and adequate sparsity.

What carries the argument

The ReSAE mechanism, consisting of an affine map fitted between selected layers whose output is subtracted to produce the residual input to the SAE, with the SAE output added back through the inverse affine chain.

Load-bearing premise

The linearly predictable carried-forward information between layers can be removed without losing features essential to the model's downstream behavior.

What would settle it

A controlled test where multi-layer ReSAE interventions recover less cross-entropy than standard SAE interventions on a specific task would falsify the claim of superior preservation of computation-relevant components.

Figures

Figures reproduced from arXiv: 2605.27819 by Calvin Yeung, Mohsen Imani, Neel Desai, Prathyush Poduval.

Figure 1
Figure 1. Figure 1: Residualization overview and cross-layer regression [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Pythia-1.4B reconstruction and decoder geometry. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Pythia-1.4B recovered CE and overinteraction for all four selected layers. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Pythia-1.4B feature-level results. Left: sparse probing accuracy using the top five SAE features across selected layers. Right: TPP under the per-layer joint budget. Higher is better in both panels. Global Budget Per-Layer Joint k Raw ReSAE Raw ReSAE 40 +0.035 +0.032 +0.028 −0.012 80 +0.045 +0.002 +0.007 −0.040 160 +0.035 −0.022 −0.013 −0.124 320 +0.011 −0.007 −0.043 −0.077 [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 5
Figure 5. Figure 5: Gemma-2-9B layer-gap trends (k = 160). Left: mean max decoder cosine similarity by layer gap; lower indicates less redundant decoder directions. Right: ∆CE by layer gap. Solid lines use the teacher-forced left axis, while dashed lines use the sequential-online right axis [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
read the original abstract

Sparse autoencoders are usually trained one layer at a time, even though transformer residual stream activations are strongly coupled across depth. This creates a practical problem for multi-layer interventions: different layerwise dictionaries can spend capacity representing the same carried-forward information, and replacing several layers at once can produce interactions that are not predicted by single-layer behavior. We introduce Residualized Sparse Autoencoders (ReSAEs), which fit an affine map between selected layers and train each later-layer SAE on the unexplained residual rather than on the full activation. Reconstructions are mapped back into the original activation space through the fitted affine chain, so ReSAEs can be evaluated with the same intervention protocols as ordinary SAEs. On Pythia-1.4B and Gemma-2-9B, residualization reduces decoder redundancy and improves sparse probing and targeted perturbation in most tested settings. Despite reconstructing less of the raw activation variance, ReSAEs recover more transformer cross entropy under multi-layer replacement. This gain is clearest under teacher-forcing and at sufficient sparsity online, indicating that ReSAEs preserve the components of the activation most relevant to the model's downstream computation. These results suggest that removing linearly predictable cross-layer structure is a useful default for multi-layer SAE interventions.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces Residualized Sparse Autoencoders (ReSAEs), which fit an affine map between selected transformer layers and train each subsequent SAE on the residual activation after subtracting the carried-forward linear component. Reconstructions are mapped back via the affine chain for compatibility with standard intervention protocols. On Pythia-1.4B and Gemma-2-9B, the method is reported to reduce decoder redundancy, improve sparse probing and targeted perturbation in most settings, and recover more model cross-entropy under multi-layer replacement despite reconstructing less raw activation variance; the gain is stated to be clearest at sufficient sparsity and under teacher-forcing.

Significance. If the central empirical claim holds after addressing the control issue, residualization would provide a useful default preprocessing step for multi-layer SAE interventions, directly tackling the problem of redundant capacity spent on cross-layer carry-forward. The evaluation on two distinct model families supplies a reasonable starting point for the observation.

major comments (2)
  1. [Abstract / §4] Abstract and §4 (results on multi-layer replacement): the claim that ReSAEs 'preserve the components of the activation most relevant to the model's downstream computation' because they recover more CE despite lower variance reconstruction is load-bearing for the headline result. This interpretation requires evidence that the removed affine component does not itself carry functionally relevant features. No control experiment is described that intervenes by replacing activations with the affine prediction alone (or measures CE degradation attributable to the linear part), leaving the functional importance of the residual under-determined.
  2. [§3] §3 (method) and evaluation protocol: the affine map coefficients are fit on the same data used for SAE training and downstream evaluation. While this is not a derivation that reduces to the fit by construction, the absence of a held-out layer or model for the affine fit introduces a moderate risk that reported gains partly reflect reduced training variance or back-mapping artifacts rather than isolation of computation-relevant structure. A cross-model or cross-dataset validation of the affine map would strengthen the result.
minor comments (2)
  1. [Abstract] The abstract states gains occur 'in most tested settings' and 'at sufficient sparsity'; the main text should include a table or figure that explicitly lists all settings tested and the fraction in which improvement was observed, to avoid selective reporting.
  2. [§3] Notation for the affine map (e.g., how the chain of maps is composed for multi-layer replacement) should be defined once in §3 with a clear equation, then used consistently in the intervention description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major point below, providing clarifications and committing to revisions where the concerns identify gaps in the current evidence.

read point-by-point responses
  1. Referee: [Abstract / §4] Abstract and §4 (results on multi-layer replacement): the claim that ReSAEs 'preserve the components of the activation most relevant to the model's downstream computation' because they recover more CE despite lower variance reconstruction is load-bearing for the headline result. This interpretation requires evidence that the removed affine component does not itself carry functionally relevant features. No control experiment is described that intervenes by replacing activations with the affine prediction alone (or measures CE degradation attributable to the linear part), leaving the functional importance of the residual under-determined.

    Authors: We agree that the headline interpretation would be strengthened by a direct control that replaces activations using only the fitted affine prediction and measures the resulting CE change. Our existing results show that ReSAEs (which discard the affine component) recover more CE than standard SAEs (which retain it) under multi-layer replacement, providing indirect evidence that the residual carries more intervention-relevant structure. However, this does not fully rule out functional content in the affine part. We will add the suggested control experiment (affine-only replacement) to §4 in the revised manuscript. revision: yes

  2. Referee: [§3] §3 (method) and evaluation protocol: the affine map coefficients are fit on the same data used for SAE training and downstream evaluation. While this is not a derivation that reduces to the fit by construction, the absence of a held-out layer or model for the affine fit introduces a moderate risk that reported gains partly reflect reduced training variance or back-mapping artifacts rather than isolation of computation-relevant structure. A cross-model or cross-dataset validation of the affine map would strengthen the result.

    Authors: The affine map is a low-parameter linear fit whose purpose is to remove cross-layer linear dependence before SAE training; the subsequent SAE optimization and all reported metrics (probing, perturbation, CE) are performed on the residual. While we acknowledge that fitting and evaluating on the same data introduces a risk of optimistic bias, the consistency of gains across two architecturally distinct models (Pythia-1.4B and Gemma-2-9B) and multiple tasks reduces the likelihood that results are artifacts of a single data split. We will add a brief discussion of this limitation to §3 and include a cross-dataset check (holding out a separate corpus for the affine fit) in the revision. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent evaluation.

full rationale

The paper defines ReSAE as fitting an affine map between layers and training SAEs on the residual activation, then maps reconstructions back via the chain. Central claims are empirical: reduced decoder redundancy and higher cross-entropy recovery under multi-layer interventions on Pythia-1.4B and Gemma-2-9B, despite lower raw variance reconstruction. These are measured outcomes on held-out model behavior, not quantities forced by the affine fit itself. No self-citations, uniqueness theorems, or ansatzes are invoked to justify the method. No step reduces a claimed result to the fitted parameters by construction; the evaluation uses standard intervention protocols on the same models but reports observable differences in probing and perturbation metrics.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption of strong cross-layer coupling in residual streams and on fitted affine parameters; no new entities are postulated.

free parameters (1)
  • affine map coefficients
    Fitted between selected layers to capture linearly predictable carried-forward activations.
axioms (1)
  • domain assumption Transformer residual stream activations are strongly coupled across depth
    Invoked in the first sentence to motivate residualization.

pith-pipeline@v0.9.1-grok · 5761 in / 1498 out tokens · 61230 ms · 2026-06-29T14:50:29.666137+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

8 extracted references · 7 canonical work pages · 5 internal anchors

  1. [1]

    Sparse Autoencoders Find Highly Interpretable Features in Language Models

    NeurIPS 2025 oral. Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

  2. [2]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team, Morgane Riviere, Shreya Pathak, Pier Giuseppe Sessa, Cassidy Hardin, et al. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

  3. [3]

    Llama Scope: Extracting millions of features from Llama-3.1-8b with sparse autoencoders.arXiv preprint arXiv:2410.20526,

    Zhengfu He, Wentao Shu, Xuyang Ge, Lingjie Chen, Junxuan Wang, Yunhua Zhou, Frances Liu, Qipeng Guo, Xuanjing Huang, Zuxuan Wu, Yu-Gang Jiang, and Xipeng Qiu. Llama Scope: Extracting millions of features from Llama-3.1-8b with sparse autoencoders.arXiv preprint arXiv:2410.20526,

  4. [4]

    Evaluating sparse autoencoders on targeted concept erasure tasks

    Adam Karvonen, Can Rager, Samuel Marks, and Neel Nanda. Evaluating sparse autoencoders on targeted concept erasure tasks. arXiv preprint arXiv:2411.18895, 2024a. Adam Karvonen, Benjamin Wright, Can Rager, Rico Angell, Jannik Brinkmann, Logan Smith, Claudio Mayrink Verdun, David Bau, and Samuel Marks. Measuring progress in dictionary learning for language ...

  5. [5]

    Gemma Scope: Open sparse autoencoders everywhere all at once on Gemma

    11 APREPRINT- MAY28, 2026 Tom Lieberum, Senthooran Rajamanoharan, Arthur Conmy, Lewis Smith, Nicolas Sonnerat, Vikrant Varma, János Kramár, Anca Dragan, Rohin Shah, and Neel Nanda. Gemma Scope: Open sparse autoencoders everywhere all at once on Gemma

  6. [6]

    Jumping Ahead: Improving Reconstruction Fidelity with JumpReLU Sparse Autoencoders

    Senthooran Rajamanoharan, Tom Lieberum, Nicolas Sonnerat, Arthur Conmy, Vikrant Varma, János Kramár, and Neel Nanda. Jumping ahead: Improving reconstruction fidelity with JumpReLU sparse autoencoders.arXiv preprint arXiv:2407.14435,

  7. [7]

    Towards Understanding Sycophancy in Language Models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, Amanda Askell, Samuel R. Bowman, Newton Cheng, Esin Durmus, Zac Hatfield-Dodds, Scott R. Johnston, Shauna Kravec, Timothy Maxwell, Sam McCandlish, Kamal Ndousse, Oliver Rausch, Nicholas Schiefer, Da Yan, Miranda Zhang, and Ethan Perez. Towards understanding sycophancy in language models.arXiv prepri...

  8. [8]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Andy Zou, Zifan Wang, Nicholas Carlini, Milad Nasr, J. Zico Kolter, and Matt Fredrikson. Universal and transferable adversarial attacks on aligned language models.arXiv preprint arXiv:2307.15043,