Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.
ICML 2024 Workshop on Mechanistic Interpretability , year=
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3verdicts
UNVERDICTED 3representative citing papers
Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.
citing papers explorer
-
From Circuit Evidence to Mechanistic Theory: An Inductive Logic Approach
Introduces Causal Functional Signatures grounded in causal evidence and ILP-learned architectural signatures to enable explicit, comparable, and portable mechanistic claims across model scales.
-
GKnow: Measuring the Entanglement of Gender Bias and Factual Gender
Gender bias and factual gender knowledge are severely entangled in language model circuits and neurons, making neuron ablation an unreliable method for debiasing.
-
How Much Do Circuits Tell Us? Measuring the Consistency and Specificity of Language Model Circuits
Language model circuits show high within-task consistency and necessity but substantial overlap across tasks, making them less specific than assumed.