Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri \`a Garriga-Alonso
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.