Mechanistic Interpretability for

Leonard Bereska, Stratis Gavves , journal= · 2024

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

browse 5 citing papers

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control

cs.LG · 2026-04-21 · conditional · novelty 7.0

Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.

Interpretability Can Be Actionable

cs.LG · 2026-05-11 · conditional · novelty 6.0

Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.

Training-Induced Escape from Token Clustering in a Mean-Field Formulation of Transformers

cs.LG · 2026-05-08 · unverdicted · novelty 5.0

Training a mean-field Transformer under L2 regularization induces an escape from attention-driven token clustering in later layers after initial clustering.

Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models

cs.CL · 2026-05-12 · unverdicted · novelty 4.0

Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.

How Language Models Process Negation

cs.CL · 2026-05-04

citing papers explorer

Showing 5 of 5 citing papers.

Local Linearity of LLMs Enables Activation Steering via Model-Based Linear Optimal Control cs.LG · 2026-04-21 · conditional · none · ref 67
Local linearity of LLM layers enables LQR-based closed-loop activation steering with theoretical tracking guarantees.
Interpretability Can Be Actionable cs.LG · 2026-05-11 · conditional · none · ref 111
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Training-Induced Escape from Token Clustering in a Mean-Field Formulation of Transformers cs.LG · 2026-05-08 · unverdicted · none · ref 62
Training a mean-field Transformer under L2 regularization induces an escape from attention-driven token clustering in later layers after initial clustering.
Qwen-Scope: Turning Sparse Features into Development Tools for Large Language Models cs.CL · 2026-05-12 · unverdicted · none · ref 64
Qwen-Scope provides open-source sparse autoencoders for Qwen models that function as practical interfaces for steering, evaluating, data workflows, and optimizing large language models.
How Language Models Process Negation cs.CL · 2026-05-04 · unreviewed · ref 13

Mechanistic Interpretability for

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer