CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

Adriano Koshiyama; Seonglae Cho; Zekun Wu

arxiv: 2508.12535 · v3 · submitted 2025-08-18 · 💻 cs.CL · cs.AI· cs.LG

CorrSteer: Generation-Time LLM Steering via Correlated Sparse Autoencoder Features

Seonglae Cho , Zekun Wu , Adriano Koshiyama This is my paper

Pith reviewed 2026-05-18 23:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords sparse autoencodersLLM steeringfeature selectioninference timecorrelationjailbreak preventionbias mitigationreasoning benchmarks

0 comments

The pith

CorrSteer selects SAE features by correlating correctness with inference-time activations to steer LLMs without contrastive datasets.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sparse autoencoders can pull out interpretable features from language models, but using them to steer outputs usually demands contrastive datasets or storing huge numbers of activations. CorrSteer instead records the activations that appear while the model generates answers on the fly and keeps only the features whose strength lines up with whether the answer was right or wrong. It then uses the average strength of those features as the steering amount, so the whole process runs from ordinary inference data. This yields better results on question answering, reducing bias, blocking jailbreaks, and reasoning tests for Gemma-2 2B and LLaMA-3.1 8B.

Core claim

By correlating the correctness of generated samples with the activations of sparse autoencoder features at the time of generation, CorrSteer identifies steering directions that are more relevant to the task at hand. This correlation-based selection, combined with deriving steering coefficients from average activations, automates the pipeline and yields measurable gains on QA, bias mitigation, jailbreak prevention, and reasoning benchmarks.

What carries the argument

Correlation of sample correctness with inference-time SAE activations to select features, with steering coefficients taken from their average activation values.

If this is right

Raises MMLU performance by 3.3 percent on 4000 samples for the tested models.
Improves HarmBench score by 27.2 percent using only 108 samples.
Produces features whose meanings match what each task needs.
Works across QA, bias mitigation, jailbreak prevention, and reasoning without contrastive data.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could make steering practical in settings where contrastive pairs are hard to collect.
Similar correlation checks might help find useful internal signals in other neural network components beyond SAEs.
The same selection step could be tried on tasks such as code generation or summarization to test broader applicability.

Load-bearing premise

Features that line up with correctness on the evaluation samples will actually cause better steering on new generations instead of reflecting quirks of that particular test set.

What would settle it

Applying the selected features to steer a fresh set of tasks or models and finding no improvement over random feature selection or no steering at all would show the correlations are not useful beyond the original samples.

read the original abstract

Sparse Autoencoders (SAEs) can extract interpretable features from large language models (LLMs) without supervision. However, their effectiveness in downstream steering tasks is limited by the requirement for contrastive datasets or large activation storage. To address these limitations, we propose CorrSteer, which selects features by correlating sample correctness with SAE activations from generated tokens at inference time. This approach uses only inference-time activations to extract more relevant features, thereby reducing spurious correlations. It also obtains steering coefficients from average activations, automating the entire pipeline. Our method shows improved task performance on QA, bias mitigation, jailbreaking prevention, and reasoning benchmarks on Gemma-2 2B and LLaMA-3.1 8B, notably achieving a +3.3% improvement in MMLU performance with 4000 samples and a +27.2% improvement in HarmBench with only 108 samples. Selected features demonstrate semantically meaningful patterns aligned with each task's requirements, revealing the underlying capabilities that drive performance. Our work establishes correlation-based selection as an effective and scalable approach for automated SAE steering across language model applications.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces CorrSteer, a method for generation-time steering of LLMs via sparse autoencoders. Features are selected by computing correlations between sample correctness labels and SAE activations on tokens generated at inference time; steering coefficients are then derived from average activations. This is claimed to automate the pipeline, reduce spurious correlations relative to contrastive approaches, and yield performance gains on QA, bias mitigation, jailbreak prevention, and reasoning tasks. Specific results include +3.3% MMLU on Gemma-2 2B with 4000 samples and +27.2% HarmBench on LLaMA-3.1 8B with 108 samples, with selected features described as semantically aligned with task requirements.

Significance. If the central empirical claims hold after addressing evaluation-protocol concerns, the work would provide a scalable, low-overhead alternative to contrastive or supervised SAE steering that requires neither paired datasets nor extensive activation storage. The inference-time correlation approach and automation of coefficient selection are practical strengths that could facilitate wider adoption of feature-based control in interpretability research. The reported gains across two model scales and four task categories, together with qualitative feature analysis, would strengthen the case for correlation-based selection as a general technique.

major comments (2)

[§3.2 and §4.1] §3.2 (Feature Selection) and §4.1 (Experimental Setup): Feature selection is performed by correlating correctness labels with SAE activations on generated tokens from the evaluation samples whose downstream performance is later reported (e.g., the 4000-sample MMLU set and 108-sample HarmBench set). No explicit hold-out split, permutation test on labels, or control for distribution shift between selection and steering phases is described. This overlap is load-bearing for the claim that observed gains (+3.3% MMLU, +27.2% HarmBench) reflect causal steering rather than spurious correlations fitted to the evaluation distribution.
[§4.2–4.4] §4.2–4.4 (Results tables): The reported improvements are presented without accompanying statistical tests, standard errors across random seeds, or ablation against strong non-SAE baselines (e.g., prompt engineering or activation addition with random features). For the MMLU and HarmBench numbers to support the central claim of reliable task improvement, variance estimates and significance levels are required.

minor comments (2)

[§3] Notation for the correlation coefficient and the steering multiplier should be defined once in §3 and used consistently; current usage mixes r and α without a central equation.
[Figure 3] Figure 3 (feature visualization) would benefit from a side-by-side comparison of top activating tokens before and after steering to illustrate the claimed semantic alignment.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The two major comments raise valid methodological concerns regarding potential data overlap and statistical rigor. We address each point below and have incorporated revisions to strengthen the claims.

read point-by-point responses

Referee: [§3.2 and §4.1] §3.2 (Feature Selection) and §4.1 (Experimental Setup): Feature selection is performed by correlating correctness labels with SAE activations on generated tokens from the evaluation samples whose downstream performance is later reported (e.g., the 4000-sample MMLU set and 108-sample HarmBench set). No explicit hold-out split, permutation test on labels, or control for distribution shift between selection and steering phases is described. This overlap is load-bearing for the claim that observed gains (+3.3% MMLU, +27.2% HarmBench) reflect causal steering rather than spurious correlations fitted to the evaluation distribution.

Authors: We acknowledge that feature selection was performed on the same samples used for downstream evaluation, which introduces a risk of fitting to the test distribution. This was an oversight in the original experimental design, particularly for smaller sets like the 108-sample HarmBench evaluation. To correct this, we have revised the pipeline to use an explicit hold-out split (80% for feature selection and correlation computation, 20% for steering and performance measurement) and now report results under this protocol. We have also added permutation tests on the correctness labels to quantify the significance of selected correlations beyond chance. These changes preserve the core inference-time correlation approach while providing stronger evidence against spurious fitting. revision: yes
Referee: [§4.2–4.4] §4.2–4.4 (Results tables): The reported improvements are presented without accompanying statistical tests, standard errors across random seeds, or ablation against strong non-SAE baselines (e.g., prompt engineering or activation addition with random features). For the MMLU and HarmBench numbers to support the central claim of reliable task improvement, variance estimates and significance levels are required.

Authors: We agree that the results section would be strengthened by explicit statistical analysis and additional controls. In the revised manuscript, we now report standard errors computed over multiple random seeds for both feature selection and steering coefficient estimation. We include paired statistical tests (e.g., t-tests) comparing steered performance against the unsteered baseline, with p-values for the key gains on MMLU and HarmBench. We have further added ablations against prompt engineering and random-feature activation addition baselines, confirming that the observed improvements exceed those from these simpler methods. These updates directly address the need for variance estimates and significance levels. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical method with independent evaluation

full rationale

The paper describes CorrSteer as selecting SAE features via correlation between sample correctness labels and inference-time activations, then deriving steering coefficients from average activations to apply at generation time. Performance gains (e.g., +3.3% MMLU, +27.2% HarmBench) are reported as empirical outcomes on benchmarks after this selection. No equations or steps are shown that reduce the claimed improvements to a quantity defined purely in terms of the fitted correlations or self-referential definitions. The central claim remains an empirical demonstration of improved task performance rather than a derivation forced by construction from its inputs. No self-citation chains, uniqueness theorems, or ansatz smuggling appear load-bearing in the provided description.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the established domain assumption that SAE features are sufficiently monosemantic to support steering; no new mathematical axioms or invented entities are introduced in the abstract description.

free parameters (1)

sample count for correlation
Task-specific numbers of samples (4000 for MMLU, 108 for HarmBench) are used to compute correlations; these are experimental choices rather than derived quantities.

axioms (1)

domain assumption SAE features extracted from LLM activations correspond to interpretable and steerable directions that influence generation behavior
The correlation step assumes that higher correlation with correctness identifies features whose activation can be adjusted to improve downstream task performance.

pith-pipeline@v0.9.0 · 5735 in / 1376 out tokens · 62974 ms · 2026-05-18T23:04:56.184147+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

selects features by correlating sample correctness with SAE activations from generated tokens at inference time... ri = Cov(zi, y) / sqrt(Var(zi)·Var(y))
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

steering vector v_steer = c_i · W_dec[:, i] added to residual stream

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Steering Without Breaking: Mechanistically Informed Interventions for Discrete Diffusion Language Models
cs.LG 2026-05 unverdicted novelty 8.0

Adaptive scheduling of interventions in discrete diffusion language models, timed to attribute-specific commitment schedules discovered with sparse autoencoders, delivers precise multi-attribute steering up to 93% str...
Steered Generation via Gradient-Based Optimization on Sparse Query Features
cs.LG 2026-05 unverdicted novelty 5.0

Prototype-Based Sparse Steering decomposes query activations with SAEs and optimizes sparse features via gradients to steer LLM outputs toward specific behaviors.