Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Mechanistic Interpretability for
5 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
years
2026 5roles
background 1polarities
background 1representative citing papers
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Training a mean-field Transformer under L2 regularization induces an escape from attention-driven token clustering in later layers after initial clustering.
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
citing papers explorer
-
Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
-
Interpretability Can Be Actionable
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
-
Training-Induced Escape from Token Clustering in a Mean-Field Formulation of Transformers
Training a mean-field Transformer under L2 regularization induces an escape from attention-driven token clustering in later layers after initial clustering.
-
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
- How Language Models Process Negation