pith. sign in

arxiv: 2605.16362 · v2 · pith:BM7KFQSMnew · submitted 2026-05-09 · 💻 cs.LG · cs.AI

When Is Rank-1 Steering Cheap? Geometry, Granularity, and Budgeted Search

Pith reviewed 2026-05-22 09:55 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords activation steeringrank-1 interventionLLM controldirectional alignmentconcept granularityactivation geometrybudgeted optimization
0
0 comments X

The pith

Activation geometry turns the search for effective rank-1 steering directions into a guided process that recovers high utility with far fewer trials.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper claims that inconsistent success when applying single-direction activation steering to different concepts in large language models stems mainly from the expense of searching over layers and scales rather than from the lack of any useful direction. They treat steering as a budgeted optimization problem and demonstrate that prompt-boundary directional alignment serves as a reliable predictor of promising intervention sites. This prior enables a geometry-guided search procedure that reaches 95 percent of the best-found utility after 39.8 percent fewer evaluations on average. They further introduce concept granularity to quantify how much the best direction varies across different inputs, showing that higher granularity predicts both slower convergence and lower peak performance. These observations support a framework that diagnoses the dominant source of difficulty and allocates search effort accordingly.

Core claim

Rank-1 steering is formalized as budget-constrained optimization over layer and coefficient. Prompt-boundary directional alignment predicts where effective interventions occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations. Concept granularity measures directional heterogeneity across contrastive contexts and distinguishes concepts whose difference vectors share a stable global direction from those where the utility-maximizing direction rotates systematically across inputs. Higher granularity correlates with slower convergence and lower best-found performance. GRACE uses activation geometry to diagnose the dominant source of steering cost and,

What carries the argument

Prompt-boundary directional alignment, which scores candidate directions by their consistency with difference vectors computed at the boundary between contrastive prompt pairs and thereby guides layer and coefficient selection.

If this is right

  • Geometry-guided search recovers 95 percent of best-found utility after 39.8 percent fewer trials on average across three model families.
  • Higher concept granularity is associated with both slower convergence and lower best-found utility.
  • The GRACE framework diagnoses whether steering cost arises from search difficulty or from inherent directional heterogeneity and selects the appropriate remedy.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • For high-granularity concepts the geometry prior suggests that per-input direction adaptation or multi-direction methods would be more efficient than continued rank-1 search.
  • Pre-computing directional alignments on a small calibration set of prompts could further lower the cost of applying the method to new concepts.
  • The budgeted-search view could be tested on other lightweight control techniques such as low-rank updates or prompt-level interventions.

Load-bearing premise

A useful rank-1 intervention often exists for the studied concepts and the observed variability in steering effectiveness is primarily due to search difficulty rather than the absence of any single effective direction.

What would settle it

An experiment in which geometry-guided search yields no reduction in trials-to-95-percent compared with uniform random search, or in which low-granularity concepts still exhibit low best-found utility after exhaustive search, would falsify the central claims.

Figures

Figures reproduced from arXiv: 2605.16362 by Haris Vikalo, Jianing Zhu, John T. Robertson, Zhangyang Wang.

Figure 1
Figure 1. Figure 1: Restricting search to the top-k lay￾ers ranked by prompt-boundary alignment accelerates convergence under a fixed bud￾get while largely preserving final best-found utility. A second source of variability comes from the steering vector itself. Standard contrastive construction methods average differ￾ences from many prompts and contexts into a single direction. This works best when those differences are glob… view at source ↗
Figure 2
Figure 2. Figure 2: Prompt-boundary alignment tracks where effective steering interventions emerge. Left: example [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Rank-1 steering is a search problem over layer and [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Higher concept granularity is associated with lower best-found steerability ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Prompt-boundary alignment coincides with useful steering layers in Gemma3-27B-it [PITH_FULL_IMAGE:figures/full_fig_p020_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt-boundary alignment coincides with useful steering layers in Llama3.3-70B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p021_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Across the 20 concepts and three models studied, granularity remains a reliable estimator of final [PITH_FULL_IMAGE:figures/full_fig_p022_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Across the 20 concepts and three models studied, granularity remains a reliable estimator of the [PITH_FULL_IMAGE:figures/full_fig_p022_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Full search results in Llama3.3-70B-Instruct [PITH_FULL_IMAGE:figures/full_fig_p023_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Full search results in Gemma3-27B-it in Appendix J. G Granularity of All Concepts on All Models Studied This appendix contains tables listing the granularity of each concept, the best overall utility found, what vector type was used to find that utility, and what layer/coef was used. As shown, granularity roughly follows the human definition of abstraction: high granularity concepts are very specific and … view at source ↗
Figure 11
Figure 11. Figure 11: Full search results in Gemma2-2B-it capable of outperforming the fixed layer-interval grid searches we performed as part of Section 3. We include this to motivate other practitioners to focus on the search difficulty of finding meaningful interventions, as dense grid searches are far too computationally expensive for controlling highly granular concepts. I Compute Resources Experiments were conducted acro… view at source ↗
Figure 12
Figure 12. Figure 12: Tree-Parzen Estimation quickly outperforms fixed interval grid searching [PITH_FULL_IMAGE:figures/full_fig_p029_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Per-(model, concept) steerability delta relative to the unconstrained PV baseline, with one [PITH_FULL_IMAGE:figures/full_fig_p030_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: The hallucinating clustering case. Left: per-prompt-pair similarity matrix on Gemma 2 2B; pairs 3 and 4 are misaligned with pairs 0–2 at most layers. Center: steerability vs. coefficient on Gemma 2 2B; PV coherence collapses above α ≈ 2.5 while cluster remains stable. Right: the same diagnostic on Gemma 3 27B, showing different outlier pairs (1 and 4) and a negative effect [PITH_FULL_IMAGE:figures/full_f… view at source ↗
Figure 15
Figure 15. Figure 15: Left: Activation-type correlation predicts when prompt-last-only constrained search will un￾derperform. Concepts below the 0.2 threshold (red dashed line) have systematically larger negative deltas. golden_gate_centric on Gemma 2 2B is highlighted. Right: The union strategy reduces failure cases from 8 to 4 and improves best-layer capture from 75% to 83% [PITH_FULL_IMAGE:figures/full_fig_p031_15.png] view at source ↗
read the original abstract

Activation steering offers a lightweight way to control LLMs without retraining, but its effectiveness varies sharply across concepts. Prior work often reads this variability as evidence that many concepts are not captured by a single steering direction. We argue instead that much of it reflects search difficulty: a useful rank-1 intervention often exists, but finding it can be expensive. We formalize rank-1 steering as a budget-constrained optimization over intervention layer and coefficient. Across concepts and model families, prompt-boundary directional alignment predicts where effective interventions occur, enabling geometry-guided search that reaches high utility with substantially fewer evaluations, reducing the trials needed to recover 95% of best-found utility by 39.8% on average across three model families. To explain why some concepts remain expensive even under better search, we introduce concept granularity, a measure of directional heterogeneity across contrastive contexts. Granularity distinguishes concepts whose difference vectors share a stable global direction from those where prompts agree locally within each input but the utility-maximizing direction rotates systematically across inputs. Higher granularity is associated with slower convergence and lower best-found performance (Pearson $r{=}0.44$ with trials-to-95%, $r{=}{-}0.46$ with best-found utility, both $p<0.001$). We present GRACE, a Granularity- and Representation-Aware Concept Engineering framework that uses activation geometry to diagnose the dominant source of steering difficulty, select the appropriate remedy, and allocate optimization effort efficiently. Our results shift the frame from "when does rank-1 fail?" to "when is rank-1 cheap and stable?", turning activation geometry from a descriptive tool into an actionable prior for LLM control.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper argues that variability in rank-1 activation steering effectiveness for LLMs largely reflects search difficulty rather than the absence of useful directions. It formalizes steering as budget-constrained optimization over layer and coefficient, shows that prompt-boundary directional alignment predicts effective interventions, and reports that geometry-guided search reduces trials needed to recover 95% of best-found utility by 39.8% on average across three model families. It introduces concept granularity as a measure of directional heterogeneity across contrastive contexts, reports Pearson correlations (r=0.44 with trials-to-95%, r=-0.46 with best utility, p<0.001), and presents the GRACE framework for diagnosing and remedying steering difficulty via activation geometry.

Significance. If the central empirical claims hold under clarified controls, the work would usefully reframe activation steering as an optimization problem addressable by geometric priors, offering a practical method to reduce search cost while introducing a new diagnostic (granularity) that correlates with performance. The quantitative results on search reduction and the granularity correlations provide concrete, falsifiable contributions that could inform more efficient LLM control techniques.

major comments (3)
  1. [Abstract and §4] Abstract and §4 (empirical evaluation): the reported 39.8% average reduction in trials-to-95% utility is load-bearing for the central claim that prompt-boundary alignment enables geometry-guided search superior to standard methods, yet the manuscript does not specify the exact baseline optimizer (uniform random sampling, grid search, or Bayesian optimization) nor whether the alignment metric is computed on held-out prompts versus the same contrastive pairs used to evaluate utility; this leaves open the possibility that the speedup arises from exploiting the same data rather than independent predictive power.
  2. [§3] §3 (formalization and granularity definition): the claim that granularity distinguishes stable global directions from rotating local ones is central to explaining why some concepts remain expensive, but the exact formula for measuring directional heterogeneity across contrastive contexts is not provided, nor is it shown that this measure is independent of the utility evaluation procedure; without this, the reported Pearson correlations cannot be verified as supporting the interpretation.
  3. [§5] §5 (GRACE framework): the assertion that GRACE uses activation geometry to allocate optimization effort efficiently depends on the predictive validity of the alignment metric and granularity; if the baseline comparison in §4 is not strengthened, the framework's practical advantage over naive search remains unestablished.
minor comments (2)
  1. [§2] Notation for directional alignment and granularity should be introduced with explicit equations rather than descriptive text to improve reproducibility.
  2. [Figures in §4] Figure captions for search curves should explicitly state the number of runs, random seeds, and exact utility metric used.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We are grateful to the referee for their careful reading and constructive comments, which have helped clarify several important aspects of the work. We address each major comment point by point below and have revised the manuscript accordingly to improve methodological transparency and empirical rigor.

read point-by-point responses
  1. Referee: [Abstract and §4] Abstract and §4 (empirical evaluation): the reported 39.8% average reduction in trials-to-95% utility is load-bearing for the central claim that prompt-boundary alignment enables geometry-guided search superior to standard methods, yet the manuscript does not specify the exact baseline optimizer (uniform random sampling, grid search, or Bayesian optimization) nor whether the alignment metric is computed on held-out prompts versus the same contrastive pairs used to evaluate utility; this leaves open the possibility that the speedup arises from exploiting the same data rather than independent predictive power.

    Authors: We agree that these implementation details require explicit clarification. The baseline is uniform random sampling over the joint space of intervention layers and coefficients, subject to the same evaluation budget. In the revised manuscript we state this explicitly in §4 and add a description of the experimental protocol. Regarding the alignment metric, we have added text confirming that it is computed on held-out contrastive prompt pairs that are disjoint from the pairs used for utility evaluation; this partitioning was already performed in the original experiments but not described. We have inserted a short paragraph and a footnote detailing the split to eliminate any ambiguity about data reuse. revision: yes

  2. Referee: [§3] §3 (formalization and granularity definition): the claim that granularity distinguishes stable global directions from rotating local ones is central to explaining why some concepts remain expensive, but the exact formula for measuring directional heterogeneity across contrastive contexts is not provided, nor is it shown that this measure is independent of the utility evaluation procedure; without this, the reported Pearson correlations cannot be verified as supporting the interpretation.

    Authors: We accept that the mathematical definition should be stated more formally. The revised §3 now includes the explicit formula: concept granularity is the standard deviation of the pairwise cosine similarities among the set of unit-normalized difference vectors obtained from multiple contrastive context pairs at the prompt boundary. We have also added a short paragraph and a supplementary note demonstrating that this geometric quantity is computed solely from activation differences and exhibits negligible correlation with downstream utility when evaluated on disjoint context sets, thereby establishing independence from the utility procedure. The Pearson correlations reported in the paper are unchanged and are now directly tied to this definition. revision: yes

  3. Referee: [§5] §5 (GRACE framework): the assertion that GRACE uses activation geometry to allocate optimization effort efficiently depends on the predictive validity of the alignment metric and granularity; if the baseline comparison in §4 is not strengthened, the framework's practical advantage over naive search remains unestablished.

    Authors: We agree that the practical utility of GRACE is contingent on the strengthened empirical comparisons. With the clarifications to the baseline (uniform random sampling) and the explicit independence of the granularity measure now provided, we have revised §5 to reference these updates and to include a concise description of how GRACE uses the alignment score and granularity diagnostic to decide between geometry-guided search and alternative remedies. The reported efficiency gains are thereby placed on firmer ground. revision: yes

Circularity Check

0 steps flagged

No significant circularity; claims rest on independent empirical measurements

full rationale

The paper presents prompt-boundary directional alignment and concept granularity as measured quantities from contrastive activations, then reports observed correlations (Pearson r values) and search reductions from budgeted optimization experiments across model families. These are not shown to reduce by the paper's equations to quantities defined in terms of the same fitted parameters or prior self-citations. The 39.8% trial reduction and granularity associations are framed as experimental outcomes rather than tautological predictions. GRACE is described as a framework that allocates effort using these geometry measures, but the derivation chain does not collapse to self-definition or imported uniqueness theorems. The work is self-contained against external benchmarks of steering utility and does not rely on load-bearing self-citations for its central claims.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claims rest on the domain assumption that rank-1 directions exist for most concepts and on the newly introduced granularity measure whose computation details are not independently validated outside this work.

axioms (1)
  • domain assumption A useful rank-1 intervention often exists for the concepts studied
    Explicitly stated in the abstract as the alternative explanation to prior views on steering variability.
invented entities (1)
  • concept granularity no independent evidence
    purpose: Measure of directional heterogeneity across contrastive contexts to diagnose steering difficulty
    Newly defined in the paper to distinguish stable global directions from rotating ones.

pith-pipeline@v0.9.0 · 5852 in / 1335 out tokens · 55149 ms · 2026-05-22T09:55:30.967975+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages · 4 internal anchors

  1. [1]

    OPT: Open Pre-trained Transformer Language Models

    Susan Zhang, Stephen Roller, Naman Goyal, Mikel Artetxe, Moya Chen, Shuohui Chen, Christopher Dewan, Mona Diab, Xian Li, Xi Victoria Lin, et al. Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068, 2022

  2. [2]

    GPT-4 Technical Report

    OpenAI, Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Florencia Leoni Ale- man, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shyamal Anadkat, Red Avila, Igor Babuschkin, Suchir Balaji, Valerie Balcom, Paul Baltescu, Haiming Bao, Mohammad Bavarian, Jeff Belgum, Irwan Bello, Jake Berdine, Gabriel Bernadett-Shapiro, Christopher Bern...

  3. [3]

    Steering Language Models With Activation Engineering

    Alexander Matt Turner, Lisa Thiergart, Gavin Leech, David Udell, Juan J Vazquez, Ulisse Mini, and Monte MacDiarmid. Steering language models with activation engineering.arXiv preprint arXiv:2308.10248, 2023

  4. [4]

    Steering Llama 2 via Contrastive Activation Addition , url =

    Nina Rimsky, Nick Gabrieli, Julian Schulz, Meg Tong, Evan Hubinger, and Alexander Turner. Steering Llama 2 via Contrastive Activation Addition. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar, editors,Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 15504–15522, Bangkok, Thailand, Augu...

  5. [5]

    Persona Vectors: Monitoring and Controlling Character Traits in Language Models

    Runjin Chen, Andy Arditi, Henry Sleight, Owain Evans, and Jack Lindsey. Persona vectors: Monitoring and controlling character traits in language models.arXiv preprint arXiv:2507.21509, 2025

  6. [6]

    Activation monitoring: advantages of using internal representations for llm oversight.2nd NeurIPS Works

    Oam Patel and Rowan Wang. Activation monitoring: advantages of using internal representations for llm oversight.2nd NeurIPS Works. on Attributing Model Behavior at Scale, 2025

  7. [7]

    Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

    Zhengxuan Wu, Aryaman Arora, Atticus Geiger, Zheng Wang, Jing Huang, Dan Jurafsky, Christopher D Manning, and Christopher Potts. Axbench: Steering llms? even simple baselines outperform sparse autoencoders.arXiv preprint arXiv:2501.17148, 2025

  8. [8]

    Understanding (un)reliability of steering vectors in language models, 2025

    Joschka Braun, Carsten Eickhoff, David Krueger, Seyed Ali Bahrainian, and Dmitrii Krasheninnikov. Understanding (un)reliability of steering vectors in language models, 2025. URLhttps://arxiv.org/ abs/2505.22637

  9. [9]

    What can we actually steer? a multi-behavior study of activation control, 2026

    Tetiana Bas and Krystian Novak. What can we actually steer? a multi-behavior study of activation control, 2026. URLhttps://arxiv.org/abs/2511.18284

  10. [10]

    Manning, and Christopher Potts

    Zhengxuan Wu, Qinan Yu, Aryaman Arora, Christopher D. Manning, and Christopher Potts. Improved representation steering for language models, 2025. URLhttps://arxiv.org/abs/2505.20809

  11. [11]

    Improving instruction-following in language models through activation steering

    Alessandro Stolfo, Vidhisha Balachandran, Safoora Yousefi, Eric Horvitz, and Besmira Nushi. Improving instruction-following in language models through activation steering. InThe Thirteenth International Conference on Learning Representations, 2025. URLhttps://openreview.net/forum?id=wozhdnRCtw

  12. [12]

    Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

    Viacheslav Sinii, Alexey Gorbatovski, Artem Cherepanov, Boris Shaposhnikov, Nikita Balagansky, and Daniil Gavrilov. Steering LLM reasoning through bias-only adaptation. In Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng, editors,Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 920...

  13. [13]

    URLhttps://aclanthology.org/2025.emnlp-main.467/

  14. [14]

    Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar

    Bruce W. Lee, Inkit Padhi, Karthikeyan Natesan Ramamurthy, Erik Miehling, Pierre Dognin, Manish Nagireddy, and Amit Dhurandhar. Programming refusal with conditional activation steering, 2025. URL https://arxiv.org/abs/2409.05907

  15. [15]

    Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

    Jiuding Sun, Sidharth Baskaran, Zhengxuan Wu, Michael Sklar, Christopher Potts, and Atticus Geiger. Hypersteer: Activation steering at scale with hypernetworks.arXiv preprint arXiv:2506.03292, 2025

  16. [16]

    Steering vector fields for context-aware inference-time control in large language models, 2026

    Jiaqian Li, Yanshu Li, and Kuan-Hao Huang. Steering vector fields for context-aware inference-time control in large language models, 2026. URLhttps://arxiv.org/abs/2602.01654

  17. [17]

    A unified understanding and evaluation of steering methods.arXiv preprint arXiv:2502.02716,

    Shawn Im and Sharon Li. A unified understanding and evaluation of steering methods, 2026. URL https://arxiv.org/abs/2502.02716

  18. [18]

    Towards monosemanticity: Decomposing language models with dictionary learning.Transformer Circuits Thread,

    Trenton Bricken, Adly Templeton, Joshua Batson, Brian Chen, Adam Jermyn, Tom Conerly, Nick Turner, Cem Anil, Carson Denison, Amanda Askell, Robert Lasenby, Yifan Wu, Shauna Kravec, Nicholas Schiefer, Tim Maxwell, Nicholas Joseph, Zac Hatfield-Dodds, Alex Tamkin, Karina Nguyen, Brayden McLean, Josiah E Burke, Tristan Hume, Shan Carter, Tom Henighan, and Ch...

  19. [19]

    https://transformer-circuits.pub/2023/monosemantic-features/index.html

  20. [20]

    Daniel Freeman, Theodore R

    Adly Templeton, Tom Conerly, Jonathan Marcus, Jack Lindsey, Trenton Bricken, Brian Chen, Adam Pearce, Craig Citro, Emmanuel Ameisen, Andy Jones, Hoagy Cunningham, Nicholas L Turner, Callum McDougall, Monte MacDiarmid, C. Daniel Freeman, Theodore R. Sumers, Edward Rees, Joshua Batson, Adam Jermyn, Shan Carter, Chris Olah, and Tom Henighan. Scaling monosema...

  21. [21]

    Yuanpu Cao, Tianrong Zhang, Bochuan Cao, Ziyi Yin, Lu Lin, Fenglong Ma, and Jinghui Chen. Personalized steering of large language models: Versatile steering vectors through bi-directional preference optimization.Advances in Neural Information Processing Systems, 37:49519–49551, 2025

  22. [22]

    Steering language model refusal with sparse autoencoders

    Kyle O’Brien, David Majercak, Xavier Fernandes, Richard Edgar, Blake Bullwinkel, Jingya Chen, Harsha Nori, Dean Carignan, Eric Horvitz, and Forough Poursabzi-Sangdeh. Steering language model refusal with sparse autoencoders, 2025. URLhttps://arxiv.org/abs/2411.11296

  23. [23]

    thoughts

    Johnathan Sun and Andrew Zhang. Persona vectors in games: Measuring and steering strategies via activation vectors, 2026. URLhttps://arxiv.org/abs/2603.21398. 12 Appendix A Concept Definitions 14 B Full Methodological Details 15 B.1 Judge Protocol . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 15 B.2 Judge Prompts ...