pith. sign in

arxiv: 2601.21996 · v2 · pith:Y7PEQKMHnew · submitted 2026-01-29 · 💻 cs.CL · cs.AI· cs.LG

Mechanistic Data Attribution: Tracing the Training Origins of Interpretable LLM Units

classification 💻 cs.CL cs.AIcs.LG
keywords datamechanisticinterpretabletrainingattributioncausalheadsinduction
0
0 comments X
read the original abstract

While Mechanistic Interpretability has identified interpretable circuits in LLMs, their causal origins in training data remain elusive. We introduce Mechanistic Data Attribution (MDA), a scalable framework that employs Influence Functions to trace interpretable units back to specific training samples. Through extensive experiments on the Pythia family, we causally validate that targeted intervention--removing or augmenting a small fraction of high-influence samples--significantly modulates the emergence of interpretable heads, whereas random interventions show no effect. Our analysis reveals that repetitive structural data (e.g., LaTeX, XML) acts as a mechanistic catalyst. Furthermore, we observe that interventions targeting induction head formation induce a concurrent change in the model's in-context learning (ICL) capability. This provides direct causal evidence for the long-standing hypothesis regarding the functional link between induction heads and ICL. Finally, we propose a mechanistic data augmentation pipeline that consistently accelerates circuit convergence across model scales, providing a principled methodology for steering the developmental trajectories of LLMs.

This paper has not been read by Pith yet.

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GIF: Locally Sound Geometric Information Flow Control for LLMs

    cs.AI 2026-06 unverdicted novelty 6.0

    GIF introduces a Jacobian-based upper bound on input-output mutual information in LLMs with formal Lean proof and strong empirical recall on injection and leakage benchmarks.