Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics
3 Pith papers cite this work. Polarity classification is still indexing.
abstract
Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
CreativityNeuro applies contrastive weight steering to LLMs, yielding up to 14 percentile gains on the Divergent Association Task and improved originality in human-rated tests while reducing mode collapse.
Introduces Parametric Memory Law as power law for LoRA memory capacity and MemFT threshold-guided optimization for better memory fidelity.
citing papers explorer
-
Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
-
CreativityNeuro: Steering Language Model Weights to Improve Divergent Thinking and Reduce Mode Collapse
CreativityNeuro applies contrastive weight steering to LLMs, yielding up to 14 percentile gains on the Divergent Association Task and improved originality in human-rated tests while reducing mode collapse.
-
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning
Introduces Parametric Memory Law as power law for LoRA memory capacity and MemFT threshold-guided optimization for better memory fidelity.