Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
A Language Model’s Guide Through Latent Space
5 Pith papers cite this work. Polarity classification is still indexing.
5
Pith papers citing it
representative citing papers
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
UniSteer trains a conditional flow matching model on LLM residual-stream activations to enable text-conditioned steering and classification across multiple behavioral tasks.
Proposes a fault-tolerance architecture for AI safety by analogizing unreliable AI artifacts to Byzantine nodes and applying consensus mechanisms.
citing papers explorer
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.