Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.
Title resolution pending
7 Pith papers cite this work. Polarity classification is still indexing.
citation-role summary
citation-polarity summary
roles
background 2polarities
background 2representative citing papers
Circuit-based metrics from Vision Transformer internals provide better label-free proxies for generalization under distribution shift than existing methods like model confidence.
MoE expert neurons show lower polysemanticity than dense FFN neurons, widening with sparser routing, and experts specialize in fine-grained tasks like specific linguistic operations, supporting expert-level interpretability.
A new framework using Task Subspace Logit Attribution localizes attention heads specialized for task recognition and task learning in in-context learning, showing they align and rotate hidden states within a task subspace.
Sparse autoencoders plus greedy filtering and factorization-machine interaction modeling identify minimal sets of features in Gemma-2-2B-IT and LLaMA-3.1-8B-IT whose ablation produces jailbreaks by flipping refusal to compliance.
Per-head attention contributions to the residual stream serve as strong linear features for classifying relational knowledge in LLMs, with probe accuracy correlating to relation specificity and signal distribution.
citing papers explorer
-
Data-driven Circuit Discovery for Interpretability of Language Models
Standard circuit discovery methods produce dataset-specific circuits rather than task-general ones, and a new clustering-based method discovers multiple more faithful circuits per dataset.