Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
Title resolution pending
14 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
representative citing papers
LLMs show significant biases in conflict event classification, with open-weight models exhibiting false illegitimation and adapted models showing actor bias and lexical sensitivity, making them unsuitable for unsupervised deployment.
Speech language models show in-context learning where speaking rate affects both accuracy and mimicry, and induction heads are causally necessary for this capability.
Contribution Weights combine attention, value magnitude, and directional alignment to measure token influence more faithfully than attention alone, and show attention sinks actively suppress information via a convex sink-rate to output-norm relationship.
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
DMI-Lib delivers 0.4-6.8% overhead for offline batch LLM inference and ~6% for moderate online serving while exposing rich internal signals across backends, cutting latency overhead 2-15x versus prior observability baselines.
Token-level contrastive attribution yields informative signals for some LLM benchmark failures but is not universally applicable across datasets and models.
CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.
Cosine similarity poorly predicts performance degradation from layer removal in LLMs, making direct accuracy-drop ablation a more reliable relevance metric.
ConceptTracer supplies an interactive interface and saliency/selectivity metrics to locate concept-responsive neurons in neural representations, shown on TabPFN.
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.
Safe-SAIL supplies a pre-explanation metric and segment-level simulation to interpret 1758 safety SAE features across pornography, politics, violence, and terror, with public models and tools released.
citing papers explorer
-
Are LLMs Ready for Conflict Monitoring? Empirical Evidence from West Africa
LLMs show significant biases in conflict event classification, with open-weight models exhibiting false illegitimation and adapted models showing actor bias and lexical sensitivity, making them unsuitable for unsupervised deployment.
-
In-Context Learning in Speech Language Models: Analyzing the Role of Acoustic Features, Linguistic Structure, and Induction Heads
Speech language models show in-context learning where speaking rate affects both accuracy and mimicry, and induction heads are causally necessary for this capability.
-
Stories in Space: In-Context Learning Trajectories in Conceptual Belief Space
LLMs perform in-context learning as trajectories through a structured low-dimensional conceptual belief space, with the structure visible in both behavior and internal representations and causally manipulable via interventions.
-
Instructions Shape Production of Language, not Processing
Instructions trigger a production-centered mechanism in language models, with task-specific information stable in input tokens but varying strongly in output tokens and correlating with behavior.
-
Explainable AI: Context-Aware Layer-Wise Integrated Gradients for Explaining Transformer Models
CA-LIG is a unified hierarchical attribution method that computes layer-wise Integrated Gradients fused with class-specific attention gradients to generate signed, context-sensitive explanations for transformer models.
-
Locate, Steer, and Improve: A Practical Survey of Actionable Mechanistic Interpretability in Large Language Models
The survey organizes mechanistic interpretability techniques into a Locate-Steer-Improve framework to enable actionable improvements in LLM alignment, capability, and efficiency.