arxiv: 2604.15557 · v1 · submitted 2026-04-16 · 💻 cs.LG · cs.CL

Recognition: unknown

Predicting Where Steering Vectors Succeed

Jayadev Billa

Authors on Pith no claims yet

Pith reviewed 2026-05-10 10:55 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords steering vectorslinear accessibility profilelogit lensactivation engineeringlanguage model interpretabilitydifference of meansconcept steering

0 comments

The pith

A logit-lens diagnostic predicts the layers and concepts where steering vectors succeed in language models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that a per-layer measure called the Linear Accessibility Profile, derived directly from projecting hidden states through the model's unembedding matrix, correlates strongly with how well steering vectors can alter model behavior on specific concepts. This provides a way to select effective layers and concepts for interventions without trial and error or extra training. If correct, it shifts steering from an empirical art to a predictable engineering step, with a three-regime view distinguishing easy, hard, and impossible cases for linear methods. The method requires no training and works across the tested model scales and concept families.

Core claim

The Linear Accessibility Profile (LAP) uses the logit lens to compute A_lin for each layer, and its peak value predicts both the effectiveness of difference-of-means steering vectors and the best layers for intervention across 24 binary concept families and five models. This leads to a framework classifying when standard steering works, when nonlinear approaches are required, and when steering fails entirely. An entity-steering demo confirms the prediction by showing successful redirection at LAP-recommended layers where the standard middle-layer choice has no effect.

What carries the argument

The Linear Accessibility Profile (LAP), a diagnostic that applies the unembedding matrix to intermediate hidden states to measure linear accessibility of a concept without training.

If this is right

Practitioners can select the layer with the highest A_lin for steering interventions instead of relying on the middle layer heuristic.
The three-regime framework identifies cases where difference-of-means steering succeeds, where nonlinear methods are needed, and where steering cannot work.
Steering vectors applied at LAP-predicted layers successfully redirect completions in entity-steering tasks.
High correlation between A_lin and steering success allows pre-intervention prediction of effectiveness without running the intervention.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The diagnostic could be applied to test whether A_lin predicts success for more complex or multi-token concepts beyond the binary families studied.
If the correlation holds, it suggests that steering success fundamentally depends on linear separability in the residual stream at specific depths.
Model developers could incorporate LAP computation into evaluation pipelines to identify steerable features early.

Load-bearing premise

The 24 controlled binary concept families sufficiently represent the range of concepts and behaviors encountered in real-world steering applications.

What would settle it

If experiments on additional concepts or models show that layers with high A_lin do not produce better steering success than layers with low A_lin, the predictive link would be falsified.

Figures

Figures reproduced from arXiv: 2604.15557 by Jayadev Billa.

**Figure 1.** Figure 1: Per-layer Amlp (solid) and Alin (dotted) for each concept family on Gemma-2-2B. The gap between solid and dotted lines is the probe gap ∆. All families show Alin = 0 at layers 0–15 and sharp emergence in layers 18–24. The nonlinear probe detects concepts substantially earlier— sequence reaches Amlp > 0.9 at layer 5, while Alin remains zero until layer 18. Controlled binary families (25). For steerability p… view at source ↗

**Figure 2.** Figure 2: Steerability prediction: peak Alin vs. max steering ∆P for 24 controlled binary concept families. Each point is one concept family. Higher Alin predicts stronger steering. The correlation is ρ = +0.86 on Gemma-2-2B and ρ = +0.90 on Qwen-7B, replicated across five models ( [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Perturbation sensitivity and steering effectiveness. Each point is one (layer, family) pair [PITH_FULL_IMAGE:figures/full_fig_p015_3.png] view at source ↗

**Figure 4.** Figure 4: Refusal direction demo on Llama-3.2-1B-Instruct. Separability of the refusal direction [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗

read the original abstract

Steering vectors work for some concepts and layers but fail for others, and practitioners have no way to predict which setting applies before running an intervention. We introduce the Linear Accessibility Profile (LAP), a per-layer diagnostic that repurposes the logit lens as a predictor of steering vector effectiveness. The key measure, $A_{\mathrm{lin}}$, applies the model's unembedding matrix to intermediate hidden states, requiring no training. Across 24 controlled binary concept families on five models (Pythia-2.8B to Llama-8B), peak $A_{\mathrm{lin}}$ predicts steering effectiveness at $\rho = +0.86$ to $+0.91$ and layer selection at $\rho = +0.63$ to $+0.92$. A three-regime framework explains when difference-of-means steering works, when nonlinear methods are needed, and when no method can work. An entity-steering demo confirms the prediction end-to-end: steering at the LAP-recommended layer redirects completions on Gemma-2-2B and OLMo-2-1B-Instruct, while the middle layer (the standard heuristic) has no effect on either model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LAP turns the logit lens into a solid no-training predictor for steering success on binary concepts, with strong reported correlations, but the tests stay narrow.

read the letter

The main thing to know is that this paper shows how to use a simple unembedding projection on hidden states to predict both whether steering will work and which layer to pick, and the numbers look good on their testbed. Peak A_lin correlates at 0.86-0.91 with effectiveness and 0.63-0.92 with layer choice across 24 controlled binary families on five models. They also lay out a three-regime split that lines up difference-of-means, nonlinear steering, and cases where nothing works. The end-to-end demo on Gemma-2-2B and OLMo-2-1B shows the recommended layer actually changes completions while the usual middle-layer choice does nothing. That is useful and cleanly executed for what it is. No training or extra fitting is involved, which keeps the method lightweight and grounded in the model's own weights. The correlations are post-hoc checks rather than circular, and the controlled setup lets them measure cleanly. Credit for running the same protocol across multiple model scales and including an actual intervention test instead of stopping at the diagnostic. The soft spot is scope. All the strong numbers come from artificially constructed binary concept families. Those are good for controlled measurement, but they may not capture the linear accessibility structure of messier, real-world targets like high-level behaviors or non-binary distinctions. The logit-lens step also projects only at the chosen layer and ignores later residual-stream mixing, so any erosion or nonlinear transformation downstream could break the link between A_lin and final output. The demo works, but it does not probe that risk on the kinds of concepts practitioners usually care about. This is for people already running activation engineering experiments who want to reduce blind layer sweeps. A reader who works on steering or interpretability will get immediate practical value from trying LAP first. It deserves peer review because the core diagnostic is simple, the evidence within the binary regime is sharp, and the claim is easy to test further. More runs on diverse concepts would tighten the generalization story, but the current version is worth sending out.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces the Linear Accessibility Profile (LAP), a per-layer diagnostic that uses the logit lens (specifically A_lin, the application of the unembedding matrix to hidden states) to predict the effectiveness of steering vectors without requiring any training or fitting. It reports high Spearman rank correlations (ρ = +0.86 to +0.91 for effectiveness and ρ = +0.63 to +0.92 for layer selection) across 24 controlled binary concept families on five models from Pythia-2.8B to Llama-8B. The paper also proposes a three-regime framework to explain when difference-of-means steering succeeds, when nonlinear methods are required, and when steering is impossible, and validates the approach with an entity-steering demonstration on Gemma-2-2B and OLMo-2-1B-Instruct models.

Significance. If the LAP diagnostic generalizes beyond the tested settings, it would offer a practical, zero-training method for practitioners to select layers and assess concept steerability, addressing a key practical barrier in activation steering research. The high correlations and the parameter-free construction (A_lin derived directly from unembedding and hidden states with no fitting to steering outcomes) are notable strengths that could make this a useful tool in the field.

major comments (2)

[Abstract] The central predictive claim relies on experiments limited to 24 controlled binary concept families. The representativeness of these families for general steering behavior on arbitrary real-world concepts is not established, which is load-bearing for the generalization implied in the abstract and the entity-steering demo.
[LAP definition and logit-lens application] A_lin projects hidden states at a specific layer ℓ using the unembedding matrix but does not account for subsequent transformer blocks. This raises a correctness risk that post-layer nonlinear transformations could erode the direction or mix it, decoupling A_lin from the actual steering outcome after intervention; a concrete test would involve comparing predictions to full forward-pass steering results on held-out concepts.

minor comments (1)

[Abstract] The correlation ranges are given as aggregates across models; providing per-model breakdowns or confidence intervals would enhance the presentation of results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on the scope of our experiments and the definition of the LAP. We address each major comment below, making revisions where they strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract] The central predictive claim relies on experiments limited to 24 controlled binary concept families. The representativeness of these families for general steering behavior on arbitrary real-world concepts is not established, which is load-bearing for the generalization implied in the abstract and the entity-steering demo.

Authors: We agree that the primary quantitative validation uses 24 controlled binary concept families, selected to enable precise, replicable measurement of steering success rates and layer-wise correlations. This design supports the reported Spearman correlations and the three-regime framework. The manuscript already includes an entity-steering demonstration on Gemma-2-2B and OLMo-2-1B-Instruct using real-world entity concepts, where LAP-selected layers produce measurable redirection of completions while the middle-layer baseline does not. We have revised the abstract to more narrowly state the evaluated settings and added a dedicated limitations paragraph noting that broader testing on open-ended, multi-token real-world tasks remains valuable future work. We do not view the controlled experiments as load-bearing for over-generalization claims, as both the abstract and results sections qualify the scope. revision: partial
Referee: [LAP definition and logit-lens application] A_lin projects hidden states at a specific layer ℓ using the unembedding matrix but does not account for subsequent transformer blocks. This raises a correctness risk that post-layer nonlinear transformations could erode the direction or mix it, decoupling A_lin from the actual steering outcome after intervention; a concrete test would involve comparing predictions to full forward-pass steering results on held-out concepts.

Authors: We acknowledge that A_lin applies the unembedding matrix directly to the hidden state at layer ℓ and therefore omits any subsequent transformer blocks. This is an intentional design choice to keep the diagnostic training-free and parameter-free. The high observed correlations with actual steering outcomes across five models suggest that, for the concepts tested, linear accessibility at the intervention layer remains predictive even after propagation. The three-regime framework already flags regimes where nonlinear effects dominate. To address the proposed concrete test, we have added an appendix experiment that applies LAP predictions to full forward-pass steering interventions on held-out concept families from the same distribution; the resulting correlations remain high (ρ ≈ 0.85–0.88). We have also expanded the methods and discussion sections to explicitly describe A_lin as an approximation and to cite the new validation results. revision: yes

Circularity Check

0 steps flagged

No significant circularity in LAP or A_lin derivation

full rationale

The paper defines A_lin directly as the logit-lens projection (unembedding matrix applied to intermediate hidden states) with no parameters fitted to steering success, target concepts, or intervention outcomes. The reported correlations with steering effectiveness and layer choice are post-hoc empirical measurements on the 24 binary families, not derivations that reduce to the input definitions by construction. No self-citations, uniqueness theorems, or ansatzes are used to justify the core diagnostic; the three-regime taxonomy and entity-steering demos are independent downstream tests. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the new diagnostic derived from existing model components (unembedding matrix) and empirical correlations on controlled binary concepts. No free parameters are fitted since the method requires no training. The logit lens is treated as a standard tool.

axioms (1)

domain assumption The logit lens provides a meaningful projection of hidden states to output space for assessing linear accessibility.
Invoked to define A_lin as a predictor without additional derivation in the abstract.

invented entities (2)

Linear Accessibility Profile (LAP) no independent evidence
purpose: Per-layer diagnostic to predict steering vector effectiveness
Newly introduced measure based on A_lin.
Three-regime framework no independent evidence
purpose: Classifies when difference-of-means steering, nonlinear methods, or no steering works
Proposed based on A_lin values.

pith-pipeline@v0.9.0 · 5491 in / 1464 out tokens · 52288 ms · 2026-05-10T10:55:52.486489+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Decodable but Not Corrected by Fixed Residual-Stream Linear Steering: Evidence from Medical LLM Failure Regimes
cs.AI 2026-05 unverdicted novelty 5.0

Overthinking in medical QA is linearly decodable at 71.6% accuracy yet fixed residual-stream steering yields no correction across 29 configurations, while enabling selective abstention with AUROC 0.610.

Reference graph

Works this paper leans on

16 extracted references · 10 canonical work pages · cited by 1 Pith paper · 6 internal anchors

[1]

Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda

URLhttps: //transformer-circuits.pub/2025/attribution-graphs/methods.html. Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Refusal in language models is mediated by a single direction. InAdvances in Neural Information Processing Systems (NeurIPS),

2025
[2]

Eliciting Latent Predictions from Transformers with the Tuned Lens

Nora Belrose, Zach Furman, Logan Smith, Danny Halawi, Igor Ostrovsky, Lev McKinney, Stella Biderman, and Jacob Steinhardt. Eliciting latent predictions from transformers with the tuned lens.arXiv preprint arXiv:2303.08112,

work page internal anchor Pith review arXiv
[3]

arXiv preprint arXiv:2408.10920 , year =

R´obert Csord´as, Kazuki Irie, and J¨urgen Schmidhuber. Recurrent neural networks learn to store and generate sequences using non-linear representations.arXiv preprint arXiv:2408.10920,

work page arXiv
[4]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, et al. The Llama 3 herd of models.arXiv preprint arXiv:2407.21783,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

arXiv preprint arXiv:2405.14860 , year=

Joshua Engels, Isaac Liao, Eric J Michaud, Wes Gurnee, and Max Tegmark. Not all language model features are linear.arXiv preprint arXiv:2405.14860,

work page arXiv
[6]

Gemma 2: Improving Open Language Models at a Practical Size

Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

work page internal anchor Pith review arXiv
[7]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page Pith review arXiv
[8]

2023 , month = sep, journal =

Neel Nanda, Andrew Lee, and Martin Wattenberg. Emergent linear representations in world models of self-supervised sequence models.arXiv preprint arXiv:2309.00941,

work page arXiv
[9]

RWKV: Reinventing RNNs for the Transformer Era

Bo Peng, Eric Alcaide, Quentin Anthony, Alon Albalak, Samuel Arcadinho, Stella Biderman, Huanqi Cao, Xin Cheng, Michael Chung, Matteo Grber, et al. RWKV: Reinventing RNNs for the transformer era.arXiv preprint arXiv:2305.13048,

work page internal anchor Pith review arXiv
[10]

URLhttps://transformer-circuits.pub/ 2026/emotions/index.html. Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C Daniel Freeman, Theodore R Sumers, Edward Rees, Joshua Batson, Adam Jermyn...

2026
[11]

Steering Language Models With Activation Engineering

URL https://transformer-circuits.pub/2024/scaling-monosemanticity/. Alexander Matt Turner, Lisa Thiergart, David Udell, Gavin Leech, Ulisse Mini, and Monte Mac- Diarmid. Activation addition: Steering language models without optimization.arXiv preprint arXiv:2308.10248,

work page internal anchor Pith review arXiv 2024
[12]

Jump to conclusions: Short- cutting transformers with linear transformations

Alexander Yom Din, Taelin Karidi, Leshem Choshen, and Mor Geva. Jump to conclusions: Short- cutting transformers with linear transformations. InFindings of the Association for Computational Linguistics: EMNLP 2023,

2023
[13]

Representation Engineering: A Top-Down Approach to AI Transparency

Andy Zou, Long Phan, Sarah Chen, James Campbell, Phillip Guo, Richard Ren, Alexander Pan, Xuwang Yin, Mantas Mazeika, Ann-Kathrin Dombrowski, Shashwat Goel, Nathaniel Li, Michael J Byun, Zifan Wang, Alex Mallen, Steven Basart, Sanmi Koyejo, Dawn Song, Matt Fredrikson, J Zico Kolter, and Dan Hendrycks. Representation engineering: A top-down ap- proach to A...

work page internal anchor Pith review arXiv
[14]

No training required

A Practical guidelines A practitioner wanting to steer a model on a new concept faces two questions: (a) will steering work for this concept? and (b) at which layer?LAPaddresses both: 1.ComputeA lin(ℓ).One forward pass per prompt through the frozen model, applying the unembedding to each layer’s hidden state. No training required. Cost:Lmatrix multipli- c...

2024
[15]

Par” for “Paris, France

0.1% L22 100% The raw logit lens matches or near-matches the oracle for 4 of 5 families. The tuned lens matches only geography. This is expected: difference-of-means steering injects a direction read out byWU ◦ LayerNorm, not by a learned affine correction. The raw logit lens measures exactly what the steering mechanism uses. C Multi-token extension The o...

2026
[16]

Paris” and 20 where it is “London

Choose the right time. . . Entity steering (London→Paris).To validateLAPend-to-end with full generation, we construct 20 prompts where the correct answer is “Paris” and 20 where it is “London” (both single tokens in each model’s vocabulary). We run the experiment on two architectures: Gemma-2-2B (base, 26 layers) and OLMo-2-1B-Instruct (Allen AI, 16 layer...

2023