Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
Title resolution pending
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
verdicts
UNVERDICTED 2representative citing papers
SFTMix applies mixup regularization to confidence-stratified interpolated examples during LLM instruction tuning to achieve consistent gains across models and datasets.
citing papers explorer
-
What Drives Representation Steering? A Mechanistic Case Study on Steering Refusal
Steering vectors for refusal primarily modify the OV circuit in attention, ignore most of the QK circuit, and can be sparsified to 1-10% of dimensions while retaining performance.
-
SFTMix: Elevating Language Model Instruction Tuning with Mixup Recipe
SFTMix applies mixup regularization to confidence-stratified interpolated examples during LLM instruction tuning to achieve consistent gains across models and datasets.