Perplexity Can Miss SAE Feature Damage Under Quantization

Evan Duan

arxiv: 2606.03002 · v2 · pith:VW2YKIYFnew · submitted 2026-06-02 · 💻 cs.LG · cs.AI

Perplexity Can Miss SAE Feature Damage Under Quantization

Evan Duan This is my paper

Pith reviewed 2026-06-28 11:39 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords quantizationsparse autoencodersfeature fidelityperplexitymodel compressioninterpretabilityround-to-nearestGemma-2

0 comments

The pith

Perplexity can improve under quantization while many SAE features degrade on the same tokens.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether quantized models preserve the interpretable features that sparse autoencoders extract from full-precision versions. It encodes the same tokens through a frozen SAE and measures per-feature survival via Pearson correlation on activations, across bit widths from INT8 to INT4 on two models. The key observation is that perplexity can rise even as a substantial fraction of active features are damaged or blurred. Survival rates drop gradually rather than abruptly, can be predicted from the features' original statistics alone, and overlap strongly with the features damaged by magnitude pruning. This indicates that behavioral parity is not enough to guarantee that full-precision interpretability results still hold after compression.

Core claim

The central claim is that perplexity and similar behavioral metrics can miss SAE feature damage under round-to-nearest quantization. On Gemma-2-2B, INT7 improves perplexity while degrading 18.7% of active features; under sliding-window evaluation INT6 also improves perplexity while only 51.3% of active features survive. Feature survival is graded, with 62.4% of active Pythia features and 51.3% of active Gemma features surviving at INT6; most non-surviving features are blurred. Survival is predictable from full-precision feature statistics with cross-validated AUC 0.92-0.97, and RTN quantization and matched-perplexity pruning damage strongly overlapping sets (Jaccard 0.79-0.86).

What carries the argument

A frozen SAE trained on full-precision activations, used as a fixed basis to encode both full-precision and quantized activations on identical tokens and score per-feature survival by Pearson correlation.

If this is right

Behavioral metrics alone do not confirm that full-precision interpretability findings transfer to quantized models.
Feature survival under quantization is graded and most non-surviving features are blurred rather than erased.
Feature damage under quantization overlaps strongly with damage under magnitude pruning.
Per-feature survival can be predicted from the original model's activation statistics with high accuracy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Interpretability tools built on full-precision models may give misleading results when applied to the quantized versions actually deployed.
Audits of this kind could be extended to other compression methods such as distillation or pruning at scale.
The graded blurring of features suggests that partial rather than total loss of interpretability is the typical outcome.

Load-bearing premise

That Pearson correlation between full-precision and quantized activations, measured through a frozen SAE, captures meaningful survival of the original features.

What would settle it

A case in which perplexity stays within 1% of the full-precision baseline yet fewer than 10% of active SAE features show Pearson correlation below 0.5 after quantization.

Figures

Figures reproduced from arXiv: 2606.03002 by Evan Duan.

**Figure 2.** Figure 2: Cross-scale RTN bit-width sweep. Panel (a) shows Pythia-70M; panel (b) shows Gemma-2-2B. [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Feature survival by quartile of FP16 feature statistics under RTN INT6. Quartiles are computed [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Layer-sensitivity check for Pythia-70M. RTN bit-width sweep repeated at two residual-stream [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗

read the original abstract

Quantization is a standard path to deploying large language models, and quantized models are typically judged acceptable when perplexity or downstream accuracy remains close to the full-precision original. But behavioral parity need not imply feature fidelity: the sparse-autoencoder (SAE) features used to interpret a full-precision model may change after weight rounding. We test this directly by using a frozen SAE as a fixed measurement basis, encoding full-precision and round-to-nearest (RTN) quantized activations on identical tokens, and measuring per-feature survival by Pearson correlation across bit-widths from INT8 to INT4 on Pythia-70M and Gemma-2-2B. Our central finding is that perplexity can miss feature damage: on Gemma-2-2B, INT7 improves perplexity while degrading 18.7% of active SAE features, and under sliding-window evaluation INT6 also improves perplexity while only 51.3% of active features survive. Feature survival is graded rather than cliff-like, with 62.4% of active Pythia features and 51.3% of active Gemma features surviving at INT6; most non-surviving features are blurred rather than fully damaged. Survival is also predictable from full-precision feature statistics alone, with cross-validated AUC 0.92--0.97 and peak activation as the strongest marginal predictor. Finally, RTN quantization and matched-perplexity magnitude pruning damage strongly overlapping feature sets, with Jaccard overlap 0.79--0.86 and damage-score Spearman correlation 0.98. These results show that behavioral metrics alone are insufficient evidence that full-precision interpretability findings transfer to quantized models, motivating feature-level audits of compression.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows perplexity can rise while Pearson correlations on frozen SAE features drop under RTN quantization, but the metric's sensitivity to distribution shifts is a real concern.

read the letter

The core result here is that on Gemma-2-2B, INT7 quantization improves perplexity yet 18.7% of active SAE features show degraded Pearson correlation with the full-precision version, and at INT6 under sliding window only about half survive. They also report that survival rates are predictable from full-precision stats (AUC 0.92-0.97) and that the affected features overlap heavily with those damaged by magnitude pruning at matched perplexity (Jaccard 0.79-0.86).

The work is straightforward and useful for anyone trying to move SAE-based interpretability to deployed models. Running the same frozen SAE on matched tokens across bit widths gives a clean before-after comparison, and the overlap with pruning is a nice additional observation that ties the finding to existing compression literature.

The main soft spot is the choice of Pearson correlation itself. Quantization changes activation means, variances, and noise, and Pearson is known to drop under those shifts even if the underlying token-wise pattern stays similar. The paper notes that most non-surviving features are "blurred" rather than destroyed, but without a second similarity measure or a check on whether the feature still fires on the same inputs, it's hard to know how much of the reported damage is real versus an artifact of the proxy. The assumption that the full-precision SAE remains a valid fixed basis after quantization is doing heavy lifting.

This is aimed at the mech-interp crowd working on compression. A reader who already worries about whether features survive quantization will get concrete numbers and a clear caution about relying on perplexity alone. It is worth sending to peer review; the empirical setup is simple enough that referees can check the metric concern directly.

Referee Report

1 major / 1 minor

Summary. The manuscript claims that perplexity is insufficient to certify feature fidelity after quantization of LLMs. Using a frozen SAE as a fixed basis, it encodes full-precision and RTN-quantized activations on the same tokens and measures per-feature survival via Pearson correlation. On Gemma-2-2B, INT7 improves perplexity yet degrades 18.7% of active features; under sliding-window evaluation INT6 also improves perplexity while only 51.3% of active features survive. Survival is graded (62.4% Pythia, 51.3% Gemma at INT6), most non-survivors are blurred, survival is predictable from full-precision statistics (AUC 0.92-0.97), and RTN damage overlaps strongly with magnitude pruning (Jaccard 0.79-0.86, Spearman 0.98).

Significance. If the Pearson proxy is accepted, the result is significant: it supplies concrete, cross-model quantitative evidence that behavioral parity can coexist with substantial SAE feature degradation, shows graded rather than cliff-like effects, demonstrates predictability from full-precision statistics, and establishes overlap with an independent compression method. These elements directly support the call for feature-level audits of quantized models.

major comments (1)

[Abstract] Abstract (central results paragraph): the claim that 'perplexity can miss feature damage' and the specific percentages (18.7% degraded at INT7, 51.3% survival at INT6) rest entirely on interpreting Pearson correlation between frozen-SAE encodings of full-precision vs. RTN-quantized activations as a valid measure of feature survival. Quantization systematically shifts activation means, variances, and noise profiles; Pearson correlation is sensitive to these affine changes even when token-wise activation patterns (i.e., semantic role) remain intact. No alternative similarity measure, normalized cosine, or reconstruction-error check is referenced to corroborate the proxy.

minor comments (1)

The abstract is concise and self-contained, but the manuscript should explicitly define 'active features' and state the precise token set and evaluation protocol (standard vs. sliding-window) used for the reported percentages.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. The comment on the Pearson correlation proxy is addressed point-by-point below. We believe the invariance properties of Pearson correlation directly mitigate the stated concern, while we are happy to strengthen the manuscript with additional corroborating measures.

read point-by-point responses

Referee: [Abstract] Abstract (central results paragraph): the claim that 'perplexity can miss feature damage' and the specific percentages (18.7% degraded at INT7, 51.3% survival at INT6) rest entirely on interpreting Pearson correlation between frozen-SAE encodings of full-precision vs. RTN-quantized activations as a valid measure of feature survival. Quantization systematically shifts activation means, variances, and noise profiles; Pearson correlation is sensitive to these affine changes even when token-wise activation patterns (i.e., semantic role) remain intact. No alternative similarity measure, normalized cosine, or reconstruction-error check is referenced to corroborate the proxy.

Authors: We appreciate the referee highlighting the need to justify the proxy. Pearson correlation is in fact invariant to affine transformations applied separately to each variable: adding any constant or multiplying by any positive scalar to either the full-precision or quantized activation vector leaves the coefficient unchanged. This directly counters sensitivity to mean shifts and variance rescaling that may arise from quantization. The measure therefore isolates whether the relative pattern of activation across tokens is preserved (i.e., whether the feature continues to respond proportionally to the same inputs), which aligns with our definition of survival. If semantic roles were preserved up to scaling, correlation would remain high; degradation occurs only when the relationship is disrupted by quantization noise or other distortions. We acknowledge that the original submission does not report alternative metrics. In the revised manuscript we will add (i) cosine similarity between the two activation vectors (also scale-invariant) and (ii) a reconstruction-error comparison on a held-out token set for a random subset of features. These additions will be presented alongside the existing Pearson results and will not change the reported percentages or conclusions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical metric is externally defined and not self-referential

full rationale

The paper's central measurement defines feature survival as Pearson correlation between full-precision and quantized activations passed through a frozen SAE trained on the original model. This is a direct, non-fitted proxy applied to held-out tokens; it does not reduce to any self-definition, fitted parameter renamed as prediction, or self-citation chain. The predictability result uses cross-validation on full-precision statistics to forecast the same correlation-based survival label, which is an independent empirical observation rather than a tautology. No uniqueness theorems, ansatzes smuggled via citation, or renaming of known results appear in the provided text. The derivation chain therefore remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The work depends on standard assumptions from mechanistic interpretability regarding the validity of SAEs as feature extractors and correlation as a fidelity measure.

axioms (1)

domain assumption Pearson correlation across token activations is a suitable metric for per-feature survival under quantization.
Central to measuring feature damage in the experiments described.

pith-pipeline@v0.9.1-grok · 5829 in / 1341 out tokens · 36294 ms · 2026-06-28T11:39:16.166695+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Do Activation Monitors Survive Model Updates? Benchmarking, Predicting, and Repairing Activation-Monitor Staleness
cs.LG 2026-06 unverdicted novelty 8.0

Fine-tuning updates frequently stale activation monitors for language model safety while quantization does not, with degradation predictable and repairable via label-free realignment.

Reference graph

Works this paper leans on

17 extracted references · 5 linked inside Pith · cited by 1 Pith paper

[1]

Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177,

Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177,

arXiv
[2]

How pruning reshapes features: Sparse autoencoder analysis of weight-pruned language models.arXiv preprint arXiv:2603.25325,

Hector Borobia, Elies Seguí-Mas, and Guillermina Tormo-Carbó. How pruning reshapes features: Sparse autoencoder analysis of weight-pruned language models.arXiv preprint arXiv:2603.25325,

arXiv
[3]

Sviatoslav Chalnev, Matthew Siu, and Arthur Conmy

URLhttps://transformer-circuits.pub/2023/monosemantic-features. Sviatoslav Chalnev, Matthew Siu, and Arthur Conmy. Improving steering vectors by targeting sparse autoencoder features.arXiv preprint arXiv:2411.02193,

arXiv 2023
[4]

Sparse but wrong: Incorrect l0 leads to incorrect features in sparse autoencoders.arXiv preprint arXiv:2508.16560,

David Chanin and Adrià Garriga-Alonso. Sparse but wrong: Incorrect l0 leads to incorrect features in sparse autoencoders.arXiv preprint arXiv:2508.16560,

arXiv
[5]

Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

Pith/arXiv arXiv
[6]

Toy models of superposition.arXiv preprint arXiv:2209.10652,

11 Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,

Pith/arXiv arXiv
[7]

GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

Pith/arXiv arXiv
[8]

Gemma: Open models based on Gemini research and technology.arXiv preprint arXiv:2403.08295,

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on Gemini research and technology.arXiv preprint arXiv:2403.08295,

Pith/arXiv arXiv
[9]

Piotr Jedryszek and Oliver M. Crook. Stable and steerable sparse autoencoders with weight regularization. arXiv preprint arXiv:2603.04198,

Pith/arXiv arXiv
[10]

Evaluating quantized large language models.arXiv preprint arXiv:2402.18158,

Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. Evaluating quantized large language models.arXiv preprint arXiv:2402.18158,

arXiv
[11]

Enhancing neural network interpretability with feature-aligned sparse autoencoders.arXiv preprint arXiv:2411.01220,

Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. Enhancing neural network interpretability with feature-aligned sparse autoencoders.arXiv preprint arXiv:2411.01220,

arXiv
[12]

Martin-Linares and Jonathan P

Cristina P. Martin-Linares and Jonathan P. Ling. Attribution-guided distillation of matryoshka sparse autoencoders.arXiv preprint arXiv:2512.24975,

arXiv
[13]

Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296,

Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296,

arXiv
[14]

Sparse autoencoders trained on the same data learn different features

Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features. arXiv preprint arXiv:2501.16615,

arXiv
[15]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html. Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, pp. 38087–38099. PMLR,

2024
[16]

For each window, previous tokens served as context and only newly intro- duced target positions were included in the loss. The aggregate perplexity was computed as a token-weighted mean negative log-likelihood: PPL = exp (∑ wmwLw∑ wmw ) ,(8) wherem w is the number of scored tokens in windowwandLw is the mean loss over those scored tokens. For the quantize...

2048
[17]

Condition Chunked PPL Chunked delta (%) Sliding PPL Sliding delta (%) Survived (%) Degraded (%) Damaged (%) PPL ratio Median corr

Condition Bits Loss PPL Total tokens Loss tokens Window Stride Quantized tensors Quantized params PPL delta (%) FP16 baseline 16 3.8372 46.39 288,894 288,893 2048 512 0 0 0.00 RTN INT8 8 3.8454 46.78 288,894 288,893 2048 512 182 2,024,275,968 0.82 RTN INT7 7 3.7700 43.38 288,894 288,893 2048 512 182 2,024,275,968 -6.49 RTN INT6 6 3.8020 44.79 288,894 288,...

2048

[1] [1]

Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177,

Reza Bayat, Ali Rahimi-Kalahroudi, Mohammad Pezeshki, Sarath Chandar, and Pascal Vincent. Steering large language model activations in sparse spaces.arXiv preprint arXiv:2503.00177,

arXiv

[2] [2]

How pruning reshapes features: Sparse autoencoder analysis of weight-pruned language models.arXiv preprint arXiv:2603.25325,

Hector Borobia, Elies Seguí-Mas, and Guillermina Tormo-Carbó. How pruning reshapes features: Sparse autoencoder analysis of weight-pruned language models.arXiv preprint arXiv:2603.25325,

arXiv

[3] [3]

Sviatoslav Chalnev, Matthew Siu, and Arthur Conmy

URLhttps://transformer-circuits.pub/2023/monosemantic-features. Sviatoslav Chalnev, Matthew Siu, and Arthur Conmy. Improving steering vectors by targeting sparse autoencoder features.arXiv preprint arXiv:2411.02193,

arXiv 2023

[4] [4]

Sparse but wrong: Incorrect l0 leads to incorrect features in sparse autoencoders.arXiv preprint arXiv:2508.16560,

David Chanin and Adrià Garriga-Alonso. Sparse but wrong: Incorrect l0 leads to incorrect features in sparse autoencoders.arXiv preprint arXiv:2508.16560,

arXiv

[5] [5]

Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600,

Pith/arXiv arXiv

[6] [6]

Toy models of superposition.arXiv preprint arXiv:2209.10652,

11 Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, Shauna Kravec, Zac Hatfield-Dodds, Robert Lasenby, Dawn Drain, Carol Chen, et al. Toy models of superposition.arXiv preprint arXiv:2209.10652,

Pith/arXiv arXiv

[7] [7]

GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

Elias Frantar, Saleh Ashkboos, Torsten Hoefler, and Dan Alistarh. GPTQ: Accurate post-training quantization for generative pre-trained transformers.arXiv preprint arXiv:2210.17323,

Pith/arXiv arXiv

[8] [8]

Gemma: Open models based on Gemini research and technology.arXiv preprint arXiv:2403.08295,

Gemma Team, Thomas Mesnard, Cassidy Hardin, Robert Dadashi, Surya Bhupatiraju, Shreya Pathak, Laurent Sifre, Morgane Rivière, Mihir Sanjay Kale, Juliette Love, et al. Gemma: Open models based on Gemini research and technology.arXiv preprint arXiv:2403.08295,

Pith/arXiv arXiv

[9] [9]

Piotr Jedryszek and Oliver M. Crook. Stable and steerable sparse autoencoders with weight regularization. arXiv preprint arXiv:2603.04198,

Pith/arXiv arXiv

[10] [10]

Evaluating quantized large language models.arXiv preprint arXiv:2402.18158,

Shiyao Li, Xuefei Ning, Luning Wang, Tengxuan Liu, Xiangsheng Shi, Shengen Yan, Guohao Dai, Huazhong Yang, and Yu Wang. Evaluating quantized large language models.arXiv preprint arXiv:2402.18158,

arXiv

[11] [11]

Enhancing neural network interpretability with feature-aligned sparse autoencoders.arXiv preprint arXiv:2411.01220,

Luke Marks, Alasdair Paren, David Krueger, and Fazl Barez. Enhancing neural network interpretability with feature-aligned sparse autoencoders.arXiv preprint arXiv:2411.01220,

arXiv

[12] [12]

Martin-Linares and Jonathan P

Cristina P. Martin-Linares and Jonathan P. Ling. Attribution-guided distillation of matryoshka sparse autoencoders.arXiv preprint arXiv:2512.24975,

arXiv

[13] [13]

Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296,

Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders.arXiv preprint arXiv:2411.11296,

arXiv

[14] [14]

Sparse autoencoders trained on the same data learn different features

Gonçalo Paulo and Nora Belrose. Sparse autoencoders trained on the same data learn different features. arXiv preprint arXiv:2501.16615,

arXiv

[15] [15]

Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han

URL https://transformer-circuits.pub/2024/scaling-monosemanticity/index.html. Guangxuan Xiao, Ji Lin, Mickael Seznec, Hao Wu, Julien Demouth, and Song Han. SmoothQuant: Accurate and efficient post-training quantization for large language models. InInternational Conference on Machine Learning, pp. 38087–38099. PMLR,

2024

[16] [16]

For each window, previous tokens served as context and only newly intro- duced target positions were included in the loss. The aggregate perplexity was computed as a token-weighted mean negative log-likelihood: PPL = exp (∑ wmwLw∑ wmw ) ,(8) wherem w is the number of scored tokens in windowwandLw is the mean loss over those scored tokens. For the quantize...

2048

[17] [17]

Condition Chunked PPL Chunked delta (%) Sliding PPL Sliding delta (%) Survived (%) Degraded (%) Damaged (%) PPL ratio Median corr

Condition Bits Loss PPL Total tokens Loss tokens Window Stride Quantized tensors Quantized params PPL delta (%) FP16 baseline 16 3.8372 46.39 288,894 288,893 2048 512 0 0 0.00 RTN INT8 8 3.8454 46.78 288,894 288,893 2048 512 182 2,024,275,968 0.82 RTN INT7 7 3.7700 43.38 288,894 288,893 2048 512 182 2,024,275,968 -6.49 RTN INT6 6 3.8020 44.79 288,894 288,...

2048