Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

· 2026 · cs.CL · arXiv 2601.21996

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

representative citing papers

Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies

cs.LG · 2026-06-28 · unverdicted · novelty 7.0

SMDA fits ridge regression on SAE features to distill symbolic policies then decomposes each SFT example's influence via feature-activation and output-probability deltas, demonstrated on refusal behavior in Llama-3.2-3B-Instruct.

citing papers explorer

Showing 1 of 1 citing paper.

Symbolic Mechanistic Data Attribution: Tracing Training Influence to Learned Behavioral Policies cs.LG · 2026-06-28 · unverdicted · none · ref 45 · internal anchor
SMDA fits ridge regression on SAE features to distill symbolic policies then decomposes each SFT example's influence via feature-activation and output-probability deltas, demonstrated on refusal behavior in Llama-3.2-3B-Instruct.

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

fields

years

verdicts

representative citing papers

citing papers explorer