Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
A Language Model’s Guide Through Latent Space
5 Pith papers cite this work. Polarity classification is still indexing.
representative citing papers
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
UniSteer trains a conditional flow matching model on LLM residual-stream activations to enable text-conditioned steering and classification across multiple behavioral tasks.
Proposes a fault-tolerance architecture for AI safety by analogizing unreliable AI artifacts to Byzantine nodes and applying consensus mechanisms.
citing papers explorer
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
Refusal in Language Models Is Mediated by a Single Direction
Refusal in language models is mediated by a single direction in residual stream activations that can be erased to disable safety or added to elicit refusal.
-
UniSteer: Text-Guided Flow Matching in Activation Space for Versatile LLM Steering
UniSteer trains a conditional flow matching model on LLM residual-stream activations to enable text-conditioned steering and classification across multiple behavioral tasks.
-
A Byzantine Fault Tolerance Approach towards AI Safety
Proposes a fault-tolerance architecture for AI safety by analogizing unreliable AI artifacts to Byzantine nodes and applying consensus mechanisms.
- Can Aha Moments Be Fake? Towards Quantifying Decorative and True Thinking in Chain-of-Thought