Steering Language Models Before They Speak: Logit-Level Interventions

Hyeseon An; Hyundong Jin; Shinwoo Park; Yo-Sub Han

arxiv: 2601.10960 · v2 · pith:CFYXNK5Cnew · submitted 2026-01-16 · 💻 cs.CL · cs.AI

Steering Language Models Before They Speak: Logit-Level Interventions

Hyeseon An , Shinwoo Park , Hyundong Jin , Yo-Sub Han This is my paper

classification 💻 cs.CL cs.AI

keywords steeringlogitmodelmodelsswaiauxiliarycontrolinternal

0 comments

read the original abstract

Controllable generation requires language models to realize output characteristics such as reading level, politeness, and toxicity. Existing steering methods are often indirect, require access to internal activations, or depend on auxiliary trained models. We propose SWAI, a training-free inference-time method that addresses these limitations by steering directly in logit space using corpus-derived token statistics. SWAI computes z-normalized one-vs-rest log-odds scores from labeled corpora and biases high-scoring tokens only within the model's top-K candidate set, allowing control to favor target-characteristic tokens while preserving contextually plausible choices. Across readability, politeness, and toxicity control, SWAI consistently improves over prompt-based and prior logit-level baselines without modifying model parameters, accessing internal layers, or training an auxiliary model. Selectivity and lookup-table ablations show that the gains come from target-specific statistical scores rather than generic logit perturbation. These results indicate that effective steering does not require learned controllers when the logit intervention is guided by target-specific statistics under high-probability candidates.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
cs.CL 2026-05 unverdicted novelty 6.0

DISCA uses disagreement among WVS-grounded persona panels to apply loss-averse logit corrections that reduce cultural misalignment by 10-24% on MultiTP for models 3.8B and larger, without weight changes.
Training-Free Cultural Alignment of Large Language Models via Persona Disagreement
cs.CL 2026-05 conditional novelty 6.0

DISCA converts within-country disagreement among World Values Survey personas into a bounded logit correction that reduces cultural misalignment by 10-24% on MultiTP for models 3.8B and larger across 20 countries, wit...