Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
A Language Model’s Guide Through Latent Space
5 Pith papers cite this work. Polarity classification is still indexing.
5
Pith papers citing it
representative citing papers
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
UniSteer trains a conditional flow matching model on LLM residual-stream activations to enable text-conditioned steering and classification across multiple behavioral tasks.
Proposes a fault-tolerance architecture for AI safety by analogizing unreliable AI artifacts to Byzantine nodes and applying consensus mechanisms.
citing papers explorer
-
A Byzantine Fault Tolerance Approach towards AI Safety
Proposes a fault-tolerance architecture for AI safety by analogizing unreliable AI artifacts to Byzantine nodes and applying consensus mechanisms.