Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
citation-role summary
background 1
citation-polarity summary
verdicts
UNVERDICTED 2roles
background 1polarities
background 1representative citing papers
Inflectional features stay linearly decodable across all layers while lexical identity weakens with depth in modern transformers.
citing papers explorer
-
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
-
Model Internal Sleuthing: Finding Lexical Identity and Inflectional Features in Modern Language Models
Inflectional features stay linearly decodable across all layers while lexical identity weakens with depth in modern transformers.