Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
A language model's guide through latent space
3 Pith papers cite this work. Polarity classification is still indexing.
fields
cs.LG 3representative citing papers
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
LLMs interleave true causal reasoning steps with decorative ones in CoT, with only ~2.3% of steps having high causal impact on AIME for Qwen-2.5, and a steering direction can force internal use of specific steps.
citing papers explorer
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
Can Aha Moments Be Fake? Identifying True and Decorative Thinking Steps in Chain-of-Thought
LLMs interleave true causal reasoning steps with decorative ones in CoT, with only ~2.3% of steps having high causal impact on AIME for Qwen-2.5, and a steering direction can force internal use of specific steps.