When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

Haris Vikalo; Jianing Zhu; John T. Robertson; Zhangyang Wang

arxiv: 2605.16362 · v2 · pith:BM7KFQSMnew · submitted 2026-05-09 · 💻 cs.LG · cs.AI

When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

John T. Robertson , Jianing Zhu , Haris Vikalo , Zhangyang Wang This is my paper

Pith reviewed 2026-05-22 09:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords activation steeringrank-1 interventionLLM controldirectional alignmentconcept granularityactivation geometrybudgeted optimization

0 comments

The pith

Activation geometry turns the search for effective rank-1 steering directions into a guided process that recovers high utility with far fewer trials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that inconsistent success when applying single-direction activation steering to different concepts in large language models stems mainly from the expense of searching over layers and scales rather than from the lack of any useful direction. They treat steering as a budgeted optimization problem and demonstrate that prompt-boundary directional alignment serves as a reliable predictor of promising intervention sites. This prior enables a geometry-guided search procedure that reaches 95 percent of the best-found utility after 39.8 percent fewer evaluations on average. They further introduce concept granularity to quantify how much the best direction varies across different inputs, showing that higher granularity predicts both slower convergence and lower peak performance. These observations support a framework that diagnoses the dominant source of difficulty and allocates search effort accordingly.

Core claim

Rank-1 steering is formalized as budget-constrained optimization over layer and coefficient. Prompt-boundary directional alignment predicts where effective interventions occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations. Concept granularity measures directional heterogeneity across contrastive contexts and distinguishes concepts whose difference vectors share a stable global direction from those where the utility-maximizing direction rotates systematically across inputs. Higher granularity correlates with slower convergence and lower best-found performance. GRACE uses activation geometry to diagnose the dominant source of steering cost and,

What carries the argument

Prompt-boundary directional alignment, which scores candidate directions by their consistency with difference vectors computed at the boundary between contrastive prompt pairs and thereby guides layer and coefficient selection.

If this is right

Geometry-guided search recovers 95 percent of best-found utility after 39.8 percent fewer trials on average across three model families.
Higher concept granularity is associated with both slower convergence and lower best-found utility.
The GRACE framework diagnoses whether steering cost arises from search difficulty or from inherent directional heterogeneity and selects the appropriate remedy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

For high-granularity concepts the geometry prior suggests that per-input direction adaptation or multi-direction methods would be more efficient than continued rank-1 search.
Pre-computing directional alignments on a small calibration set of prompts could further lower the cost of applying the method to new concepts.
The budgeted-search view could be tested on other lightweight control techniques such as low-rank updates or prompt-level interventions.

Load-bearing premise

A useful rank-1 intervention often exists for the studied concepts and the observed variability in steering effectiveness is primarily due to search difficulty rather than the absence of any single effective direction.

What would settle it

An experiment in which geometry-guided search yields no reduction in trials-to-95-percent compared with uniform random search, or in which low-granularity concepts still exhibit low best-found utility after exhaustive search, would falsify the central claims.

Figures

Figures reproduced from arXiv: 2605.16362 by Haris Vikalo, Jianing Zhu, John T. Robertson, Zhangyang Wang.

**Figure 1.** Figure 1: Restricting search to the top-k layers ranked by prompt-boundary alignment accelerates convergence under a fixed budget while largely preserving final best-found utility. A second source of variability comes from the steering vector itself. Standard contrastive construction methods average differences from many prompts and contexts into a single direction. This works best when those differences are glob… view at source ↗

**Figure 2.** Figure 2: Prompt-boundary alignment tracks where effective steering interventions emerge. Left: example [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 3.** Figure 3: Rank-1 steering is a search problem over layer and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗

**Figure 4.** Figure 4: Higher concept granularity is associated with lower best-found steerability ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: Prompt-boundary alignment coincides with useful steering layers in Gemma3-27B-it [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗

**Figure 6.** Figure 6: Prompt-boundary alignment coincides with useful steering layers in Llama3.3-70B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗

**Figure 7.** Figure 7: Across the 20 concepts and three models studied, granularity remains a reliable estimator of final [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗

**Figure 8.** Figure 8: Across the 20 concepts and three models studied, granularity remains a reliable estimator of the [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗

**Figure 9.** Figure 9: Full search results in Llama3.3-70B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗

**Figure 10.** Figure 10: Full search results in Gemma3-27B-it in Appendix J. G Granularity of All Concepts on All Models Studied This appendix contains tables listing the granularity of each concept, the best overall utility found, what vector type was used to find that utility, and what layer/coef was used. As shown, granularity roughly follows the human definition of abstraction: high granularity concepts are very specific and … view at source ↗

**Figure 11.** Figure 11: Full search results in Gemma2-2B-it capable of outperforming the fixed layer-interval grid searches we performed as part of Section 3. We include this to motivate other practitioners to focus on the search difficulty of finding meaningful interventions, as dense grid searches are far too computationally expensive for controlling highly granular concepts. I Compute Resources Experiments were conducted acro… view at source ↗

**Figure 12.** Figure 12: Tree-Parzen Estimation quickly outperforms fixed interval grid searching [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗

**Figure 13.** Figure 13: Per-(model, concept) steerability delta relative to the unconstrained PV baseline, with one [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗

**Figure 14.** Figure 14: The hallucinating clustering case. Left: per-prompt-pair similarity matrix on Gemma 2 2B; pairs 3 and 4 are misaligned with pairs 0–2 at most layers. Center: steerability vs. coefficient on Gemma 2 2B; PV coherence collapses above α ≈ 2.5 while cluster remains stable. Right: the same diagnostic on Gemma 3 27B, showing different outlier pairs (1 and 4) and a negative effect [PITH_FULL_IMAGE:figures/full_f… view at source ↗

**Figure 15.** Figure 15: Left: Activation-type correlation predicts when prompt-last-only constrained search will underperform. Concepts below the 0.2 threshold (red dashed line) have systematically larger negative deltas. golden_gate_centric on Gemma 2 2B is highlighted. Right: The union strategy reduces failure cases from 8 to 4 and improves best-layer capture from 75% to 83% [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗

read the original abstract

Activation steering offers a lightweight way to control LLMs without retraining, but its effectiveness varies sharply across concepts. Prior work often reads this variability as evidence that many concepts are not captured by a single steering direction. We argue instead that much of it reflects search difficulty: a useful rank-1 intervention often exists, but finding it can be expensive. We formalize rank-1 steering as a budget-constrained optimization over intervention layer and coefficient. Across concepts and model families, prompt-boundary directional alignment predicts where effective interventions occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations, reducing the trials needed to recover 95% of best-found utility by 39.8% on average across three model families. To explain why some concepts remain expensive even under better search, we introduce concept granularity, a measure of directional heterogeneity across contrastive contexts. Granularity distinguishes concepts whose difference vectors share a stable global direction from those where prompts agree locally within each input but the utility-maximizing direction rotates systematically across inputs. Higher granularity is associated with slower convergence and lower best-found performance (Pearson $r{=}0.44$ with trials-to-95%, $r{=}{-}0.46$ with best-found utility, both $p<0.001$). We present GRACE, a Granularity- and Representation-Aware Concept Engineering framework that uses activation geometry to diagnose the dominant source of steering difficulty, select the appropriate remedy, and allocate optimization effort efficiently. Our results shift the frame from "when does rank-1 fail?" to "when is rank-1 cheap and stable?", turning activation geometry from a descriptive tool into an actionable prior for LLM control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper reframes steering variability as search cost and shows geometry-guided search plus a new granularity measure can cut trials by 40% while flagging hard cases.

read the letter

The main takeaway is that much of the hit-or-miss nature of rank-1 activation steering comes from the expense of finding the right layer and coefficient rather than the lack of any useful direction. The authors use directional alignment computed at prompt boundaries to prioritize the search and report reaching 95% of the best-found utility with 39.8% fewer trials on average across three model families. They also introduce concept granularity to measure how much the effective direction shifts across different contrastive contexts, and they link higher granularity to slower convergence and weaker final utility through clear Pearson correlations with p-values below 0.001. The GRACE framework then turns these geometric signals into a practical rule for choosing the remedy and spending the budget. These pieces are the actual novelties relative to earlier steering work. The experiments give the claims some breadth by testing across model families and by quantifying both the search savings and the granularity correlations. The reframing from failure modes to budgeted cost feels like a useful shift for anyone who actually deploys these interventions. The soft spots are mostly around the comparison setup. The headline reduction would be more convincing against a stronger baseline than plain random search, and it matters whether the alignment metric is computed on held-out data or re-uses the same contrastive pairs that define utility. Granularity is a fresh construct, so its exact definition and sensitivity to how contexts are sampled need to be checked carefully in the methods. This paper is for researchers working on activation steering, LLM editing, and lightweight control techniques. Readers who tune interventions by hand or want diagnostics for when rank-1 will be cheap will get concrete value from the geometry prior and the granularity measure. It deserves peer review because the claims are specific, the experiments span multiple models, and the gaps are fixable with clearer baselines and definitions rather than fundamental problems with the approach.

Referee Report

3 major / 2 minor

Summary. The paper argues that variability in rank-1 activation steering effectiveness for LLMs largely reflects search difficulty rather than the absence of useful directions. It formalizes steering as budget-constrained optimization over layer and coefficient, shows that prompt-boundary directional alignment predicts effective interventions, and reports that geometry-guided search reduces trials needed to recover 95% of best-found utility by 39.8% on average across three model families. It introduces concept granularity as a measure of directional heterogeneity across contrastive contexts, reports Pearson correlations (r=0.44 with trials-to-95%, r=-0.46 with best utility, p<0.001), and presents the GRACE framework for diagnosing and remedying steering difficulty via activation geometry.

Significance. If the central empirical claims hold under clarified controls, the work would usefully reframe activation steering as an optimization problem addressable by geometric priors, offering a practical method to reduce search cost while introducing a new diagnostic (granularity) that correlates with performance. The quantitative results on search reduction and the granularity correlations provide concrete, falsifiable contributions that could inform more efficient LLM control techniques.

major comments (3)

[Abstract and §4] Abstract and §4 (empirical evaluation): the reported 39.8% average reduction in trials-to-95% utility is load-bearing for the central claim that prompt-boundary alignment enables geometry-guided search superior to standard methods, yet the manuscript does not specify the exact baseline optimizer (uniform random sampling, grid search, or Bayesian optimization) nor whether the alignment metric is computed on held-out prompts versus the same contrastive pairs used to evaluate utility; this leaves open the possibility that the speedup arises from exploiting the same data rather than independent predictive power.
[§3] §3 (formalization and granularity definition): the claim that granularity distinguishes stable global directions from rotating local ones is central to explaining why some concepts remain expensive, but the exact formula for measuring directional heterogeneity across contrastive contexts is not provided, nor is it shown that this measure is independent of the utility evaluation procedure; without this, the reported Pearson correlations cannot be verified as supporting the interpretation.
[§5] §5 (GRACE framework): the assertion that GRACE uses activation geometry to allocate optimization effort efficiently depends on the predictive validity of the alignment metric and granularity; if the baseline comparison in §4 is not strengthened, the framework's practical advantage over naive search remains unestablished.

minor comments (2)

[§2] Notation for directional alignment and granularity should be introduced with explicit equations rather than descriptive text to improve reproducibility.
[Figures in §4] Figure captions for search curves should explicitly state the number of runs, random seeds, and exact utility metric used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their careful reading and constructive comments, which have helped clarify several important aspects of the work. We address each major comment point by point below and have revised the manuscript accordingly to improve methodological transparency and empirical rigor.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (empirical evaluation): the reported 39.8% average reduction in trials-to-95% utility is load-bearing for the central claim that prompt-boundary alignment enables geometry-guided search superior to standard methods, yet the manuscript does not specify the exact baseline optimizer (uniform random sampling, grid search, or Bayesian optimization) nor whether the alignment metric is computed on held-out prompts versus the same contrastive pairs used to evaluate utility; this leaves open the possibility that the speedup arises from exploiting the same data rather than independent predictive power.

Authors: We agree that these implementation details require explicit clarification. The baseline is uniform random sampling over the joint space of intervention layers and coefficients, subject to the same evaluation budget. In the revised manuscript we state this explicitly in §4 and add a description of the experimental protocol. Regarding the alignment metric, we have added text confirming that it is computed on held-out contrastive prompt pairs that are disjoint from the pairs used for utility evaluation; this partitioning was already performed in the original experiments but not described. We have inserted a short paragraph and a footnote detailing the split to eliminate any ambiguity about data reuse. revision: yes
Referee: [§3] §3 (formalization and granularity definition): the claim that granularity distinguishes stable global directions from rotating local ones is central to explaining why some concepts remain expensive, but the exact formula for measuring directional heterogeneity across contrastive contexts is not provided, nor is it shown that this measure is independent of the utility evaluation procedure; without this, the reported Pearson correlations cannot be verified as supporting the interpretation.

Authors: We accept that the mathematical definition should be stated more formally. The revised §3 now includes the explicit formula: concept granularity is the standard deviation of the pairwise cosine similarities among the set of unit-normalized difference vectors obtained from multiple contrastive context pairs at the prompt boundary. We have also added a short paragraph and a supplementary note demonstrating that this geometric quantity is computed solely from activation differences and exhibits negligible correlation with downstream utility when evaluated on disjoint context sets, thereby establishing independence from the utility procedure. The Pearson correlations reported in the paper are unchanged and are now directly tied to this definition. revision: yes
Referee: [§5] §5 (GRACE framework): the assertion that GRACE uses activation geometry to allocate optimization effort efficiently depends on the predictive validity of the alignment metric and granularity; if the baseline comparison in §4 is not strengthened, the framework's practical advantage over naive search remains unestablished.

Authors: We agree that the practical utility of GRACE is contingent on the strengthened empirical comparisons. With the clarifications to the baseline (uniform random sampling) and the explicit independence of the granularity measure now provided, we have revised §5 to reference these updates and to include a concise description of how GRACE uses the alignment score and granularity diagnostic to decide between geometry-guided search and alternative remedies. The reported efficiency gains are thereby placed on firmer ground. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical measurements

full rationale

The paper presents prompt-boundary directional alignment and concept granularity as measured quantities from contrastive activations, then reports observed correlations (Pearson r values) and search reductions from budgeted optimization experiments across model families. These are not shown to reduce by the paper's equations to quantities defined in terms of the same fitted parameters or prior self-citations. The 39.8% trial reduction and granularity associations are framed as experimental outcomes rather than tautological predictions. GRACE is described as a framework that allocates effort using these geometry measures, but the derivation chain does not collapse to self-definition or imported uniqueness theorems. The work is self-contained against external benchmarks of steering utility and does not rely on load-bearing self-citations for its central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that rank-1 directions exist for most concepts and on the newly introduced granularity measure whose computation details are not independently validated outside this work.

axioms (1)

domain assumption A useful rank-1 intervention often exists for the concepts studied
Explicitly stated in the abstract as the alternative explanation to prior views on steering variability.

invented entities (1)

concept granularity no independent evidence
purpose: Measure of directional heterogeneity across contrastive contexts to diagnose steering difficulty
Newly defined in the paper to distinguish stable global directions from rotating ones.

pith-pipeline@v0.9.0 · 5852 in / 1335 out tokens · 55149 ms · 2026-05-22T09:55:30.967975+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

[1]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[2]

GPT-4 Technical Report

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
[3]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[4]

Steering Llama 2 via Contrastive Activation Addition , url =

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering Llama 2 via Contrastive Activation Addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

work page doi:10.18653/v1/2024.acl-long.828 2024
[5]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Activation monitoring: advantages of using internal representations for llm oversight.2nd NeurIPS Works

Oam Patel and Rowan Wang. Activation monitoring: advantages of using internal representations for llm oversight.2nd NeurIPS Works. on Attributing Model Behavior at Scale, 2025

work page 2025
[7]

Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

work page arXiv 2025
[8]

Understanding (un)reliability of steering vectors in language models, 2025

Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, and Dmitrii Krasheninnikov. Understanding (un)reliability of steering vectors in language models, 2025. URLhttps://arxiv.org/ abs/2505.22637

work page arXiv 2025
[9]

What can we actually steer? a multi-behavior study of activation control, 2026

Tetiana Bas and Krystian Novak. What can we actually steer? a multi-behavior study of activation control, 2026. URLhttps://arxiv.org/abs/2511.18284

work page arXiv 2026
[10]

Manning, and Christopher Potts

Zhengxuan Wu, Qinan Yu, Aryaman Arora, Christopher D. Manning, and Christopher Potts. Improved representation steering for language models, 2025. URLhttps://arxiv.org/abs/2505.20809

work page arXiv 2025
[11]

Improving instruction-following in language models through activation steering

Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=wozhdnRCtw

work page 2025
[12]

Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

Viacheslav Sinii, Alexey Gorbatovski, Artem Cherepanov, Boris Shaposhnikov, Nikita Balagansky, and Daniil Gavrilov. Steering LLM reasoning through bias-only adaptation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 920...

work page doi:10.18653/v1/2025.emnlp-main 2025
[13]

URLhttps://aclanthology.org/2025.emnlp-main.467/

work page 2025
[14]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering, 2025. URL https://arxiv.org/abs/2409.05907

work page arXiv 2025
[15]

Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, and Atticus Geiger. Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

work page arXiv 2025
[16]

Steering vector fields for context-aware inference-time control in large language models, 2026

Jiaqian Li, Yanshu Li, and Kuan-Hao Huang. Steering vector fields for context-aware inference-time control in large language models, 2026. URLhttps://arxiv.org/abs/2602.01654

work page arXiv 2026
[17]

A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716,

Shawn Im and Sharon Li. A unified understanding and evaluation of steering methods, 2026. URL https://arxiv.org/abs/2502.02716

work page arXiv 2026
[18]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread,

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

work page
[19]

https://transformer-circuits.pub/2023/monosemantic-features/index.html

work page 2023
[20]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

work page 2024
[21]

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization.Advances in Neural Information Processing Systems, 37:49519–49551, 2025

work page 2025
[22]

Steering language model refusal with sparse autoencoders

Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders, 2025. URLhttps://arxiv.org/abs/2411.11296

work page arXiv 2025
[23]

thoughts

Johnathan Sun and Andrew Zhang. Persona vectors in games: Measuring and steering strategies via activation vectors, 2026. URLhttps://arxiv.org/abs/2603.21398. 12 Appendix A Concept Definitions 14 B Full Methodological Details 15 B.1 Judge Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Judge Prompts ...

work page arXiv 2026

[1] [1]

OPT: Open Pre-trained Transformer Language Models

Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022

[2] [2]

GPT-4 Technical Report

OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023

[3] [3]

Steering Language Models With Activation Engineering

Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[4] [4]

Steering Llama 2 via Contrastive Activation Addition , url =

Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering Llama 2 via Contrastive Activation Addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

work page doi:10.18653/v1/2024.acl-long.828 2024

[5] [5]

Persona Vectors: Monitoring and Controlling Character Traits in Language Models

Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Activation monitoring: advantages of using internal representations for llm oversight.2nd NeurIPS Works

Oam Patel and Rowan Wang. Activation monitoring: advantages of using internal representations for llm oversight.2nd NeurIPS Works. on Attributing Model Behavior at Scale, 2025

work page 2025

[7] [7]

Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

work page arXiv 2025

[8] [8]

Understanding (un)reliability of steering vectors in language models, 2025

Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, and Dmitrii Krasheninnikov. Understanding (un)reliability of steering vectors in language models, 2025. URLhttps://arxiv.org/ abs/2505.22637

work page arXiv 2025

[9] [9]

What can we actually steer? a multi-behavior study of activation control, 2026

Tetiana Bas and Krystian Novak. What can we actually steer? a multi-behavior study of activation control, 2026. URLhttps://arxiv.org/abs/2511.18284

work page arXiv 2026

[10] [10]

Manning, and Christopher Potts

Zhengxuan Wu, Qinan Yu, Aryaman Arora, Christopher D. Manning, and Christopher Potts. Improved representation steering for language models, 2025. URLhttps://arxiv.org/abs/2505.20809

work page arXiv 2025

[11] [11]

Improving instruction-following in language models through activation steering

Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=wozhdnRCtw

work page 2025

[12] [12]

Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

Viacheslav Sinii, Alexey Gorbatovski, Artem Cherepanov, Boris Shaposhnikov, Nikita Balagansky, and Daniil Gavrilov. Steering LLM reasoning through bias-only adaptation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 920...

work page doi:10.18653/v1/2025.emnlp-main 2025

[13] [13]

URLhttps://aclanthology.org/2025.emnlp-main.467/

work page 2025

[14] [14]

Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering, 2025. URL https://arxiv.org/abs/2409.05907

work page arXiv 2025

[15] [15]

Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, and Atticus Geiger. Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

work page arXiv 2025

[16] [16]

Steering vector fields for context-aware inference-time control in large language models, 2026

Jiaqian Li, Yanshu Li, and Kuan-Hao Huang. Steering vector fields for context-aware inference-time control in large language models, 2026. URLhttps://arxiv.org/abs/2602.01654

work page arXiv 2026

[17] [17]

A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716,

Shawn Im and Sharon Li. A unified understanding and evaluation of steering methods, 2026. URL https://arxiv.org/abs/2502.02716

work page arXiv 2026

[18] [18]

Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread,

Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

work page

[19] [19]

https://transformer-circuits.pub/2023/monosemantic-features/index.html

work page 2023

[20] [20]

Daniel Freeman, Theodore R

Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

work page 2024

[21] [21]

Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization.Advances in Neural Information Processing Systems, 37:49519–49551, 2025

work page 2025

[22] [22]

Steering language model refusal with sparse autoencoders

Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders, 2025. URLhttps://arxiv.org/abs/2411.11296

work page arXiv 2025

[23] [23]

thoughts

Johnathan Sun and Andrew Zhang. Persona vectors in games: Measuring and steering strategies via activation vectors, 2026. URLhttps://arxiv.org/abs/2603.21398. 12 Appendix A Concept Definitions 14 B Full Methodological Details 15 B.1 Judge Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Judge Prompts ...

work page arXiv 2026