Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

· 2026 · cs.CL · arXiv 2602.02343

3 Pith papers cite this work. Polarity classification is still indexing.

3 Pith papers citing it

open full Pith review browse 3 citing papers arXiv PDF

abstract

Methods for controlling large language models (LLMs), including local weight fine-tuning, LoRA-based adaptation, and activation-based interventions, are often studied in isolation, obscuring their connections and making comparison difficult. In this work, we present a unified view that frames these interventions as dynamic weight updates induced by a control signal, placing them within a single conceptual framework. Building on this view, we propose a unified preference-utility analysis that separates control effects into preference, defined as the tendency toward a target concept, and utility, defined as coherent and task-valid generation, and measures both on a shared log-odds scale using polarity-paired contrastive examples. Across methods, we observe a consistent trade-off between preference and utility: stronger control increases preference while predictably reducing utility. We further explain this behavior through an activation manifold perspective, in which control shifts representations along target-concept directions to enhance preference, while utility declines primarily when interventions push representations off the model's valid-generation manifold. Finally, we introduce a new steering approach SPLIT guided by this analysis that improves preference while better preserving utility. Code is available at https://github.com/zjunlp/EasyEdit/blob/main/examples/SPLIT.md.

representative citing papers

Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection

cs.LG · 2026-05-27 · unverdicted · novelty 6.0

Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.

CreativityNeuro: Steering Language Model Weights to Improve Divergent Thinking and Reduce Mode Collapse

cs.AI · 2026-07-01 · unverdicted · novelty 5.0

CreativityNeuro applies contrastive weight steering to LLMs, yielding up to 14 percentile gains on the Divergent Association Task and improved originality in human-rated tests while reducing mode collapse.

How LoRA Remembers? A Parametric Memory Law for LLM Finetuning

cs.CL · 2026-05-28 · unverdicted · novelty 5.0

Introduces Parametric Memory Law as power law for LoRA memory capacity and MemFT threshold-guided optimization for better memory fidelity.

citing papers explorer

Showing 3 of 3 citing papers after filters.

Activation Steering for Synthetic Data Generation: The Role of Diversity in Downstream Safety Detection cs.LG · 2026-05-27 · unverdicted · none · ref 56 · internal anchor
Activation steering produces synthetic safety-violating data that improves downstream classifiers over prompting on most tested concepts when a harmonic mean of alignment, coherence, and diversity is optimized.
CreativityNeuro: Steering Language Model Weights to Improve Divergent Thinking and Reduce Mode Collapse cs.AI · 2026-07-01 · unverdicted · none · ref 30 · internal anchor
CreativityNeuro applies contrastive weight steering to LLMs, yielding up to 14 percentile gains on the Divergent Association Task and improved originality in human-rated tests while reducing mode collapse.
How LoRA Remembers? A Parametric Memory Law for LLM Finetuning cs.CL · 2026-05-28 · unverdicted · none · ref 40 · internal anchor
Introduces Parametric Memory Law as power law for LoRA memory capacity and MemFT threshold-guided optimization for better memory fidelity.

Why Steering Works: Toward a Unified View of Language Model Parameter Dynamics

fields

years

verdicts

representative citing papers

citing papers explorer