Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

Alexander Binder; Maximilian Dreyer; Patrick Kahardipraja; Reduan Achtibat; Sayed Mohammad Vakilzadeh Hatefi; Sebastian Lapuschkin; Thomas Wiegand; Wojciech Samek

arxiv: 2506.13727 · v3 · submitted 2025-06-16 · 💻 cs.LG · cs.AI· cs.CL

Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

Sayed Mohammad Vakilzadeh Hatefi , Maximilian Dreyer , Reduan Achtibat , Patrick Kahardipraja , Thomas Wiegand , Wojciech Samek , Alexander Binder , Sebastian Lapuschkin This is my paper

Pith reviewed 2026-05-19 08:49 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords circuit discoverymodel pruningLLM interpretabilitytoxic output reductionLRP attributionscontrastive relevancesmall language modelstargeted model correction

0 comments

The pith

Pruning 0.3% of neurons via LRP attributions substantially reduces toxic outputs in small LLMs while preserving general performance.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper seeks to establish that circuit discovery in large language models can be reframed as attributing the most contributory parameters for specific behaviors using Layer-wise Relevance Propagation with reference samples, then extracting those circuits through targeted pruning. This matters to a sympathetic reader because it offers a concrete path to both interpret internal model mechanisms and intervene on undesirable outputs such as toxicity or repetition without broad capability loss. The authors introduce contrastive relevance to separate undesired behavior circuits from those supporting general function. Experiments on OPT-125M demonstrate that extremely small fractions of components, when pruned according to these attributions, produce the desired corrections, with the method transferring to other small-scale architectures.

Core claim

We frame circuit discovery as identifying parameters that contribute most to model outputs on task-specific inputs, and use Layer-wise Relevance Propagation (LRP) with reference samples to attribute and extract these components via pruning. Building on this, we introduce contrastive relevance to isolate circuits associated with undesired behaviors while preserving general capabilities, enabling targeted model correction. On OPT-125M, we show that pruning as little as ~0.3% of neurons substantially reduces toxic outputs, while pruning approximately 0.03% of weight elements mitigates repetitive text generation without degrading general performance. These results establish attribution-guided pr

What carries the argument

Layer-wise Relevance Propagation (LRP) attributions combined with contrastive relevance scoring to rank and prune parameters that drive specific output behaviors.

If this is right

Pruning roughly 0.3 percent of neurons identified by attributions substantially reduces toxic outputs on OPT-125M.
Pruning about 0.03 percent of weight elements mitigates repetitive text generation without degrading general performance.
The attribution-guided pruning approach transfers to additional small-scale language models beyond OPT-125M.
Contrastive relevance allows isolation of undesired-behavior circuits while leaving general capabilities intact.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the attributions remain reliable at larger scales, the method could support minimal-intervention safety patches that edit only tiny parameter subsets.
The observed localization of behaviors to fractions of a percent suggests that many model properties may be editable through sparse, targeted interventions rather than full retraining.
Extending contrastive relevance to other behaviors such as hallucination or bias could yield a general toolkit for modular model correction.
The pruning results invite direct comparison with activation patching or other causal intervention techniques to test whether attribution ranks align with causal effect sizes.

Load-bearing premise

LRP attributions with reference samples accurately isolate the causal parameters responsible for the target behaviors rather than merely correlated ones.

What would settle it

After pruning the top 0.3 percent of neurons or 0.03 percent of weights identified by the LRP method, measure the rate of toxic or repetitive outputs on the original test prompts; absence of substantial reduction or unexpected drop in general capabilities would falsify the claim.

Figures

Figures reproduced from arXiv: 2506.13727 by Alexander Binder, Maximilian Dreyer, Patrick Kahardipraja, Reduan Achtibat, Sayed Mohammad Vakilzadeh Hatefi, Sebastian Lapuschkin, Thomas Wiegand, Wojciech Samek.

**Figure 2.** Figure 2: a) IOI circuits are identified at the edge level – weight elements – within the linear layers of the OPT model, specifically in the up and down projection layers of the MLP blocks (fc1 and fc2). For Wanda, row-wise unstructured pruning is applied. In contrast, for LRP and gradient, we perform global sorting of components across all layers rather than within each row. b) IOI circuits extracted within neuron… view at source ↗

**Figure 3.** Figure 3: a) Pruning neurons from the fc1 layers of the OPT model using attribution information significantly reduces the toxicity measure. This has been achieved in its best case without affecting the general performance (measured by perplexity on WikiText2). The shaded regions indicate the standard error of the mean. b) The scatter plot illustrates per-sample toxicity changes in model responses to prompts from X T… view at source ↗

**Figure 4.** Figure 4: a) Pruning approximately 7,000 (≈ %0.03 of total) weight elements from the fc1 layers of the OPT model by using LRP in particular reduces repetitive responses, measured using the Response Uniqueness Ratio (RUR). This approach minimizes performance loss (perplexity on WikiText2). The shaded regions indicate the standard error of the mean. b) The scatter plot shows per-sample RUR changes to prompts from X Re… view at source ↗

**Figure 5.** Figure 5: Model responses here qualitatively illustrate the effects of pruning-based targeted correction [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: a) Shematic architecure of a decoder-based transfomer demonstrates the sequential combination of linear layers which constitutes the MLPs and attention heads. These individual layers, involve a weight matrix denoted by W which is a favorable target for pruning. b) The formula expressed in Sec. 3.2 shows how LRP attributes each individual neurons inside linear layers, making it well-suited for structured p… view at source ↗

**Figure 7.** Figure 7: Applying an individual pruning rate to a linear layer with the weight matrix [PITH_FULL_IMAGE:figures/full_fig_p015_7.png] view at source ↗

**Figure 8.** Figure 8: We performed row-wise unstructured pruning on TinyLlama using Wanda and [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗

**Figure 9.** Figure 9: Attribution scores from Wanda and LRP were compared across three representative layers of the TinyLlama model: the layer with the highest average importance (Layer 21, MLP, down_proj), the median layer (Layer 6, MLP, gate_proj), and the layer with the lowest average importance (Layer 0, Attn, k_proj). The layers were ranked according to the importance scores computed by LRP. The histograms reveal that Wand… view at source ↗

**Figure 10.** Figure 10: Number of attribution (relevance) scores exceeding a fixed threshold after min-max [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: Based on the IOI circuits extracted from the OPT model using structured pruning, LRP identifies sparser and more effective circuits across neurons in MLPs and attention heads compared Wanda and gradient. Notably, due to the absence of explicit weight parameters for individual attention heads in standard Transformer architectures, Wanda cannot be applied for circuit discovery within these heads (see Append… view at source ↗

**Figure 12.** Figure 12: An overview of IOI circuits discovered from the OPT model using different layer types and unstructured pruning approaches (described in Appendix C) is shown in the figure. Panels a and b correspond to the row-wise and globally unstructured pruning approaches, respectively. Panel c represents the best configuration for each attribution method, where the row-wise approach is optimal for Wanda and the global… view at source ↗

**Figure 13.** Figure 13: Removing neurons from different linear layers within the MLPs blocks of the OPT [PITH_FULL_IMAGE:figures/full_fig_p020_13.png] view at source ↗

**Figure 14.** Figure 14: Pruning few weight elements across various MLPs layers improves the toxicity score of [PITH_FULL_IMAGE:figures/full_fig_p021_14.png] view at source ↗

**Figure 15.** Figure 15: Reducing repetition in generated text can be achieved by selectively pruning neurons [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Following the approach in Fig [PITH_FULL_IMAGE:figures/full_fig_p023_16.png] view at source ↗

read the original abstract

Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mechanistic interpretability addresses this challenge by identifying circuits -- subsets of model components responsible for specific behaviors. However, discovering such circuits in LLMs remains difficult due to their scale and complexity. We frame circuit discovery as identifying parameters that contribute most to model outputs on task-specific inputs, and use Layer-wise Relevance Propagation (LRP) with reference samples to attribute and extract these components via pruning. Building on this, we introduce contrastive relevance to isolate circuits associated with undesired behaviors while preserving general capabilities, enabling targeted model correction. On OPT-125M, we show that pruning as little as ~0.3% of neurons substantially reduces toxic outputs, while pruning approximately 0.03% of weight elements mitigates repetitive text generation without degrading general performance. These results establish attribution-guided pruning as an effective mechanism for identifying and intervening on behavior-specific circuits in LLMs. We further validate our findings on additional small-scale language models, demonstrating that the proposed approach transfers across architectures. Our code is publicly available at https://github.com/erfanhatefi/SparC3.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Layer-wise Relevance Propagation (LRP) with reference samples and a contrastive relevance formulation can identify small subsets of neurons or weights responsible for specific undesired behaviors (toxicity, repetition) in small LLMs such as OPT-125M. Pruning ~0.3% of neurons or ~0.03% of weights guided by these attributions substantially reduces the target behaviors while preserving general performance; the approach is presented as a scalable method for circuit discovery and targeted correction, with validation on additional small models and public code release.

Significance. If the central empirical claims are supported by appropriate controls, the work offers a practical, attribution-based pruning technique that links interpretability tools to model editing in small-scale LLMs. Strengths include the public code repository and the focus on minimal intervention (sub-1% pruning) that leaves general capabilities intact; this could be useful for diagnosing and mitigating specific failure modes without full retraining.

major comments (3)

[Results / Experiments] Results section (experiments on OPT-125M): the reported reductions in toxicity and repetition after pruning top-LRP components are shown, but the manuscript does not include controls that prune an equal number of randomly selected or magnitude-thresholded parameters while measuring the same target metrics. Without these, it remains possible that the observed effect arises from capacity reduction rather than attribution-guided isolation of causal circuits.
[Method / Contrastive Relevance] Method section on contrastive relevance: the definition and computation of contrastive relevance (using reference samples) is described at a high level, yet the paper provides insufficient detail on how reference samples are selected and whether they adequately control for general language modeling capability versus the undesired behavior. This choice is load-bearing for the claim that pruning isolates behavior-specific circuits.
[Evaluation / Metrics] Evaluation protocol: the manuscript reports that general performance is preserved, but does not specify the exact benchmarks, number of runs, or statistical tests used to support the 'without degrading general performance' claim. This weakens the targeted-correction interpretation.

minor comments (2)

[Figures] Figure captions and legends should explicitly state the exact pruning percentages and the baseline (if any) used for comparison.
[Abstract / Results] The abstract states 'pruning approximately 0.03% of weight elements'; the corresponding experimental table or figure should report the precise count and layer distribution for reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback, which has helped us improve the clarity and rigor of our manuscript. We address each major comment below and have revised the paper to incorporate the suggested additions and clarifications.

read point-by-point responses

Referee: [Results / Experiments] Results section (experiments on OPT-125M): the reported reductions in toxicity and repetition after pruning top-LRP components are shown, but the manuscript does not include controls that prune an equal number of randomly selected or magnitude-thresholded parameters while measuring the same target metrics. Without these, it remains possible that the observed effect arises from capacity reduction rather than attribution-guided isolation of causal circuits.

Authors: We agree that control experiments are essential to demonstrate that the observed reductions stem from attribution-guided circuit isolation rather than general capacity loss. In the revised manuscript we have added new experiments that prune an equal number of randomly selected neurons/weights as well as magnitude-thresholded parameters (top-k by absolute value). These controls show substantially smaller reductions in toxicity and repetition compared with LRP-guided pruning, while general performance remains comparable across conditions. The new results and figures are included in the updated Results section. revision: yes
Referee: [Method / Contrastive Relevance] Method section on contrastive relevance: the definition and computation of contrastive relevance (using reference samples) is described at a high level, yet the paper provides insufficient detail on how reference samples are selected and whether they adequately control for general language modeling capability versus the undesired behavior. This choice is load-bearing for the claim that pruning isolates behavior-specific circuits.

Authors: We appreciate this observation. The revised Method section now provides a detailed account of reference-sample construction: for toxicity we sample equal numbers of toxic and non-toxic sentences from the RealToxicityPrompts dataset, matched for length and topic distribution; for repetition we use a held-out set of repetitive versus non-repetitive continuations generated from the same prompts. We also include an explicit discussion of how the contrastive formulation subtracts relevance attributable to general language-modeling behavior. These additions clarify the load-bearing design choices. revision: yes
Referee: [Evaluation / Metrics] Evaluation protocol: the manuscript reports that general performance is preserved, but does not specify the exact benchmarks, number of runs, or statistical tests used to support the 'without degrading general performance' claim. This weakens the targeted-correction interpretation.

Authors: We agree that precise reporting strengthens the targeted-correction claim. The revised Evaluation section now lists the exact benchmarks (WikiText-103 perplexity, zero-shot accuracy on PIQA, HellaSwag, and ARC-easy), states that all metrics are averaged over five independent pruning runs with different random seeds, and reports paired t-tests (p > 0.05) confirming no statistically significant degradation in general performance. These details have been added to both the Evaluation and Results sections. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical attribution-guided pruning

full rationale

The paper frames circuit discovery as an empirical process: LRP attributions (with reference samples and contrastive relevance) are computed on task-specific inputs, top-attributed parameters are pruned, and behavioral changes are measured experimentally on OPT-125M and other small models. No central quantity is defined in terms of itself, no fitted parameter is relabeled as a prediction, and no load-bearing step reduces to a self-citation chain or ansatz smuggled from prior author work. The reported outcomes (toxicity reduction after ~0.3% neuron pruning, repetition mitigation after ~0.03% weight pruning) are presented as direct experimental results rather than derived by construction from the inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the untested premise that relevance scores identify causally responsible components and that small-scale results transfer to the intended use cases.

axioms (1)

domain assumption LRP attributions with reference samples faithfully reflect component contributions to task-specific outputs
Invoked in the framing of circuit discovery as parameter attribution

invented entities (1)

contrastive relevance no independent evidence
purpose: Isolate circuits tied to undesired behaviors while preserving general capabilities
New quantity introduced to separate toxic or repetitive circuits from baseline behavior

pith-pipeline@v0.9.0 · 5795 in / 1184 out tokens · 22047 ms · 2026-05-19T08:49:50.204511+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

46 extracted references · 46 canonical work pages · 10 internal anchors

[1]

Achtibat, R., Hatefi, S. M. V ., Dreyer, M., Jain, A., Wiegand, T., Lapuschkin, S., and Samek, W. (2024). AttnLRP: Attention-aware layer-wise relevance propagation for transformers. In Proceed- ings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 135–168. PMLR

work page 2024
[2]

Llama 3 model card

AI@Meta (2024). Llama 3 model card

work page 2024
[3]

J., Weber, L., Neumann, D., Samek, W., Müller, K.-R., and Lapuschkin, S

Anders, C. J., Weber, L., Neumann, D., Samek, W., Müller, K.-R., and Lapuschkin, S. (2022). Finding and removing clever hans: Using explanation methods to debug and improve deep models. Information Fusion, 77:261–295

work page 2022
[4]

Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140

work page 2015
[5]

W.-D., and McWilliams, B

Balduzzi, D., Frean, M., Leary, L., Lewis, J., Ma, K. W.-D., and McWilliams, B. (2017). The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 342–350. PMLR

work page 2017
[6]

and Filippova, K

Bastings, J. and Filippova, K. (2020). The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 149–155. Association for Computational Linguistics

work page 2020
[7]

Becking, D., Dreyer, M., Samek, W., Müller, K., and Lapuschkin, S. (2022). ECQ x: Explainability-Driven Quantization for Low-Bit and Sparse DNNs. In xxAI - Beyond Explainable AI, Lecture Notes in Computer Science (LNAI Vol. 13200), Springer International Publishing , pages 271–296

work page 2022
[8]

Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. (2023). To- wards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352

work page 2023
[9]

Dai, D., Dong, L., Hao, Y ., Sui, Z., Chang, B., and Wei, F. (2021). Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696

work page arXiv 2021
[10]

J., Samek, W., and Lapuschkin, S

Dreyer, M., Pahde, F., Anders, C. J., Samek, W., and Lapuschkin, S. (2024). From hope to safety: Unlearning biases of deep models via gradient penalization in latent space. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 21046–21054

work page 2024
[11]

and V oita, E

Ferrando, J. and V oita, E. (2024). Information flow routes: Automatically interpreting language models at scale. arXiv preprint arXiv:2403.00824

work page arXiv 2024
[12]

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. (2024). A framework for few-shot language model evaluation. 10

work page 2024
[13]

Gehman, S., Gururangan, S., Sap, M., Choi, Y ., and Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462

work page internal anchor Pith review Pith/arXiv arXiv 2020
[14]

and Stork, D

Hassibi, B. and Stork, D. (1992). Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5

work page 1992
[15]

Hatefi, S. M. V ., Dreyer, M., Achtibat, R., Wiegand, T., Samek, W., and Lapuschkin, S. (2024). Pruning by explaining revisited: Optimizing attribution methods to prune cnns and transformers. arXiv preprint arXiv:2408.12568

work page arXiv 2024
[16]

W., and Keutzer, K

Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. (2023). Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629

work page arXiv 2023
[17]

LeCun, Y ., Denker, J., and Solla, S. (1989). Optimal brain damage. Advances in neural information processing systems, 2

work page 1989
[18]

Ma, X., Fang, G., and Wang, X. (2023). Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720

work page 2023
[19]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y ., Bau, D., and Mueller, A. (2024). Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647

work page internal anchor Pith review Pith/arXiv arXiv 2024
[20]

Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016). Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016
[21]

(2019).Layer-Wise Relevance Propagation: An Overview, pages 193–209

Montavon, G., Binder, A., Lapuschkin, S., Samek, W., and Müller, K.-R. (2019).Layer-Wise Relevance Propagation: An Overview, pages 193–209. Springer International Publishing, Cham

work page 2019
[22]

Muralidharan, S., Turuvekere Sreenivas, S., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., and Molchanov, P. (2024). Compact language models via pruning and knowledge distillation. Advances in Neural Information Processing Systems, 37:41076–41102

work page 2024
[23]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744

work page 2022
[24]

Pahde, F., Dreyer, M., Samek, W., and Lapuschkin, S. (2023). Reveal to revise: An explainable ai life cycle for iterative bias correction of deep models. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 596–606. Springer

work page 2023
[25]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67

work page 2020
[26]

Ravfogel, S., Elazar, Y ., Gonen, H., Twiton, M., and Goldberg, Y . (2020). Null it out: Guarding protected attributes by iterative nullspace projection. arXiv preprint arXiv:2004.07667

work page arXiv 2020
[27]

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations

Ross, A. S., Hughes, M. C., and Doshi-Velez, F. (2017). Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717

work page internal anchor Pith review Pith/arXiv arXiv 2017
[28]

Samek, W., Binder, A., Montavon, G., Lapuschkin, S., and Müller, K.-R. (2017). Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673

work page 2017
[29]

Schramowski, P., Stammer, W., Teso, S., Brugger, A., Herbert, F., Shao, X., Luigs, H.-G., Mahlein, A.-K., and Kersting, K. (2020). Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nature Machine Intelligence, 2(8):476–486

work page 2020
[30]

Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. (2017). Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825

work page internal anchor Pith review Pith/arXiv arXiv 2017
[31]

Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. (2023). A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023
[32]

Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning , volume 70, pages 3319–3328. PMLR

work page 2017
[33]

Syed, A., Rager, C., and Conmy, A. (2023). Attribution patching outperforms automated circuit discovery. arXiv preprint arXiv:2310.10348

work page arXiv 2023
[34]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023
[35]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. (2023). Steering language models with activation engineering. arXiv preprint arXiv:2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023
[36]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30

work page 2017
[37]

V oita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. (2019). Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 5797–5808. Association for Computational Linguistics

work page 2019
[38]

Wang, Z., Wohlwend, J., and Lei, T. (2019). Structured pruning of large language models.arXiv preprint arXiv:1910.04732

work page arXiv 2019
[39]

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. (2023). SmoothQuant: Accurate and efficient post-training quantization for large language models. volume 202 of Proceedings of Machine Learning Research. PMLR

work page 2023
[40]

Yeom, S.-K., Seegerer, P., Lapuschkin, S., Binder, A., Wiedemann, S., Müller, K.-R., and Samek, W. (2021). Pruning by explaining: A novel criterion for deep neural network pruning. Pattern Recognition, 115:107899

work page 2021
[41]

Zhang, P., Zeng, G., Wang, T., and Lu, W. (2024). Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385

work page internal anchor Pith review Pith/arXiv arXiv 2024
[42]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. (2022a). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

work page internal anchor Pith review Pith/arXiv arXiv
[43]

Zhang, Y ., Wang, G., Yang, T., Pang, T., He, Z., and Lv, J. (2022b). Compression of deep neural networks: bridging the gap between conventional-based pruning and evolutionary approach. Neural Computing and Applications, 34(19):16493–16514. 12 A Transformer Models A.1 Llama and OPT In the official implementations of Llama and OPT, the attention mechanism ...

work page
[44]

6, LRP can be extended to compute relevance scores at the parameter level, ensuring that each weight element is evaluated for its direct contribution to model decisions

and as shown in Fig. 6, LRP can be extended to compute relevance scores at the parameter level, ensuring that each weight element is evaluated for its direct contribution to model decisions. LRP is typically implemented as a modified gradient method, where the gradient is scaled by an input term. As described in [7] and detailed in Eq. (8), LRP offers the...

work page
[45]

achieves efficient attribution using only a forward pass. It combines weight magnitudes and activations to derive attribution scores for a given weight matrix W at layer l with input activations X, computing Rl W as: Rl W = |W| · ||X||2 (10) Rl W has the same dimensions as W. Each individual element of Rl W corresponds to a relevance score for the associa...

work page
[46]

facebook/opt-125m

and apply a uniform pruning rate to rows of weight matrices across all linear layers using a row-wise unstructured approach. This method is illustrated in Fig. 7, which also compares alternative pruning strategies. In contrast to compression, we have followed these approaches for circuit discovery: • Globally Strctured: We compute an importance score for ...

work page 2048

[1] [1]

Achtibat, R., Hatefi, S. M. V ., Dreyer, M., Jain, A., Wiegand, T., Lapuschkin, S., and Samek, W. (2024). AttnLRP: Attention-aware layer-wise relevance propagation for transformers. In Proceed- ings of the 41st International Conference on Machine Learning, volume 235 of Proceedings of Machine Learning Research, pages 135–168. PMLR

work page 2024

[2] [2]

Llama 3 model card

AI@Meta (2024). Llama 3 model card

work page 2024

[3] [3]

J., Weber, L., Neumann, D., Samek, W., Müller, K.-R., and Lapuschkin, S

Anders, C. J., Weber, L., Neumann, D., Samek, W., Müller, K.-R., and Lapuschkin, S. (2022). Finding and removing clever hans: Using explanation methods to debug and improve deep models. Information Fusion, 77:261–295

work page 2022

[4] [4]

Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.-R., and Samek, W. (2015). On pixel-wise explanations for non-linear classifier decisions by layer-wise relevance propagation. PloS one, 10(7):e0130140

work page 2015

[5] [5]

W.-D., and McWilliams, B

Balduzzi, D., Frean, M., Leary, L., Lewis, J., Ma, K. W.-D., and McWilliams, B. (2017). The shattered gradients problem: If resnets are the answer, then what is the question? In Proceedings of the 34th International Conference on Machine Learning, volume 70, pages 342–350. PMLR

work page 2017

[6] [6]

and Filippova, K

Bastings, J. and Filippova, K. (2020). The elephant in the interpretability room: Why use attention as explanation when we have saliency methods? In Proceedings of the Third BlackboxNLP Workshop on Analyzing and Interpreting Neural Networks for NLP, pages 149–155. Association for Computational Linguistics

work page 2020

[7] [7]

Becking, D., Dreyer, M., Samek, W., Müller, K., and Lapuschkin, S. (2022). ECQ x: Explainability-Driven Quantization for Low-Bit and Sparse DNNs. In xxAI - Beyond Explainable AI, Lecture Notes in Computer Science (LNAI Vol. 13200), Springer International Publishing , pages 271–296

work page 2022

[8] [8]

Conmy, A., Mavor-Parker, A., Lynch, A., Heimersheim, S., and Garriga-Alonso, A. (2023). To- wards automated circuit discovery for mechanistic interpretability.Advances in Neural Information Processing Systems, 36:16318–16352

work page 2023

[9] [9]

Dai, D., Dong, L., Hao, Y ., Sui, Z., Chang, B., and Wei, F. (2021). Knowledge neurons in pretrained transformers. arXiv preprint arXiv:2104.08696

work page arXiv 2021

[10] [10]

J., Samek, W., and Lapuschkin, S

Dreyer, M., Pahde, F., Anders, C. J., Samek, W., and Lapuschkin, S. (2024). From hope to safety: Unlearning biases of deep models via gradient penalization in latent space. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 21046–21054

work page 2024

[11] [11]

and V oita, E

Ferrando, J. and V oita, E. (2024). Information flow routes: Automatically interpreting language models at scale. arXiv preprint arXiv:2403.00824

work page arXiv 2024

[12] [12]

Gao, L., Tow, J., Abbasi, B., Biderman, S., Black, S., DiPofi, A., Foster, C., Golding, L., Hsu, J., Le Noac’h, A., Li, H., McDonell, K., Muennighoff, N., Ociepa, C., Phang, J., Reynolds, L., Schoelkopf, H., Skowron, A., Sutawika, L., Tang, E., Thite, A., Wang, B., Wang, K., and Zou, A. (2024). A framework for few-shot language model evaluation. 10

work page 2024

[13] [13]

Gehman, S., Gururangan, S., Sap, M., Choi, Y ., and Smith, N. A. (2020). Realtoxicityprompts: Evaluating neural toxic degeneration in language models. arXiv preprint arXiv:2009.11462

work page internal anchor Pith review Pith/arXiv arXiv 2020

[14] [14]

and Stork, D

Hassibi, B. and Stork, D. (1992). Second order derivatives for network pruning: Optimal brain surgeon. Advances in neural information processing systems, 5

work page 1992

[15] [15]

Hatefi, S. M. V ., Dreyer, M., Achtibat, R., Wiegand, T., Samek, W., and Lapuschkin, S. (2024). Pruning by explaining revisited: Optimizing attribution methods to prune cnns and transformers. arXiv preprint arXiv:2408.12568

work page arXiv 2024

[16] [16]

W., and Keutzer, K

Kim, S., Hooper, C., Gholami, A., Dong, Z., Li, X., Shen, S., Mahoney, M. W., and Keutzer, K. (2023). Squeezellm: Dense-and-sparse quantization. arXiv preprint arXiv:2306.07629

work page arXiv 2023

[17] [17]

LeCun, Y ., Denker, J., and Solla, S. (1989). Optimal brain damage. Advances in neural information processing systems, 2

work page 1989

[18] [18]

Ma, X., Fang, G., and Wang, X. (2023). Llm-pruner: On the structural pruning of large language models. Advances in neural information processing systems, 36:21702–21720

work page 2023

[19] [19]

Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

Marks, S., Rager, C., Michaud, E. J., Belinkov, Y ., Bau, D., and Mueller, A. (2024). Sparse feature circuits: Discovering and editing interpretable causal graphs in language models. arXiv preprint arXiv:2403.19647

work page internal anchor Pith review Pith/arXiv arXiv 2024

[20] [20]

Merity, S., Xiong, C., Bradbury, J., and Socher, R. (2016). Pointer sentinel mixture models. arXiv preprint arXiv:1609.07843

work page internal anchor Pith review Pith/arXiv arXiv 2016

[21] [21]

(2019).Layer-Wise Relevance Propagation: An Overview, pages 193–209

Montavon, G., Binder, A., Lapuschkin, S., Samek, W., and Müller, K.-R. (2019).Layer-Wise Relevance Propagation: An Overview, pages 193–209. Springer International Publishing, Cham

work page 2019

[22] [22]

Muralidharan, S., Turuvekere Sreenivas, S., Joshi, R., Chochowski, M., Patwary, M., Shoeybi, M., Catanzaro, B., Kautz, J., and Molchanov, P. (2024). Compact language models via pruning and knowledge distillation. Advances in Neural Information Processing Systems, 37:41076–41102

work page 2024

[23] [23]

Ouyang, L., Wu, J., Jiang, X., Almeida, D., Wainwright, C., Mishkin, P., Zhang, C., Agarwal, S., Slama, K., Ray, A., et al. (2022). Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730–27744

work page 2022

[24] [24]

Pahde, F., Dreyer, M., Samek, W., and Lapuschkin, S. (2023). Reveal to revise: An explainable ai life cycle for iterative bias correction of deep models. In International Conference on Medical Image Computing and Computer-Assisted Intervention, pages 596–606. Springer

work page 2023

[25] [25]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y ., Li, W., and Liu, P. J. (2020). Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of machine learning research, 21(140):1–67

work page 2020

[26] [26]

Ravfogel, S., Elazar, Y ., Gonen, H., Twiton, M., and Goldberg, Y . (2020). Null it out: Guarding protected attributes by iterative nullspace projection. arXiv preprint arXiv:2004.07667

work page arXiv 2020

[27] [27]

Right for the Right Reasons: Training Differentiable Models by Constraining their Explanations

Ross, A. S., Hughes, M. C., and Doshi-Velez, F. (2017). Right for the right reasons: Training differentiable models by constraining their explanations. arXiv preprint arXiv:1703.03717

work page internal anchor Pith review Pith/arXiv arXiv 2017

[28] [28]

Samek, W., Binder, A., Montavon, G., Lapuschkin, S., and Müller, K.-R. (2017). Evaluating the visualization of what a deep neural network has learned. IEEE Transactions on Neural Networks and Learning Systems, 28(11):2660–2673

work page 2017

[29] [29]

Schramowski, P., Stammer, W., Teso, S., Brugger, A., Herbert, F., Shao, X., Luigs, H.-G., Mahlein, A.-K., and Kersting, K. (2020). Making deep neural networks right for the right scientific reasons by interacting with their explanations. Nature Machine Intelligence, 2(8):476–486

work page 2020

[30] [30]

Smilkov, D., Thorat, N., Kim, B., Viégas, F., and Wattenberg, M. (2017). Smoothgrad: removing noise by adding noise. arXiv preprint arXiv:1706.03825

work page internal anchor Pith review Pith/arXiv arXiv 2017

[31] [31]

Sun, M., Liu, Z., Bair, A., and Kolter, J. Z. (2023). A simple and effective pruning approach for large language models. arXiv preprint arXiv:2306.11695. 11

work page internal anchor Pith review Pith/arXiv arXiv 2023

[32] [32]

Sundararajan, M., Taly, A., and Yan, Q. (2017). Axiomatic attribution for deep networks. In Proceedings of the 34th International Conference on Machine Learning , volume 70, pages 3319–3328. PMLR

work page 2017

[33] [33]

Syed, A., Rager, C., and Conmy, A. (2023). Attribution patching outperforms automated circuit discovery. arXiv preprint arXiv:2310.10348

work page arXiv 2023

[34] [34]

Touvron, H., Martin, L., Stone, K., Albert, P., Almahairi, A., Babaei, Y ., Bashlykov, N., Batra, S., Bhargava, P., Bhosale, S., et al. (2023). Llama 2: Open foundation and fine-tuned chat models. arXiv preprint arXiv:2307.09288

work page internal anchor Pith review Pith/arXiv arXiv 2023

[35] [35]

Steering Language Models With Activation Engineering

Turner, A. M., Thiergart, L., Leech, G., Udell, D., Vazquez, J. J., Mini, U., and MacDiarmid, M. (2023). Steering language models with activation engineering. arXiv preprint arXiv:2308.10248

work page internal anchor Pith review Pith/arXiv arXiv 2023

[36] [36]

N., Kaiser, Ł., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. (2017). Attention is all you need. Advances in Neural Information Processing Systems, 30

work page 2017

[37] [37]

V oita, E., Talbot, D., Moiseev, F., Sennrich, R., and Titov, I. (2019). Analyzing multi-head self-attention: Specialized heads do the heavy lifting, the rest can be pruned. In Proceedings of the 57th Annual Meeting of the Association for Computational Linguistics , pages 5797–5808. Association for Computational Linguistics

work page 2019

[38] [38]

Wang, Z., Wohlwend, J., and Lei, T. (2019). Structured pruning of large language models.arXiv preprint arXiv:1910.04732

work page arXiv 2019

[39] [39]

Xiao, G., Lin, J., Seznec, M., Wu, H., Demouth, J., and Han, S. (2023). SmoothQuant: Accurate and efficient post-training quantization for large language models. volume 202 of Proceedings of Machine Learning Research. PMLR

work page 2023

[40] [40]

Yeom, S.-K., Seegerer, P., Lapuschkin, S., Binder, A., Wiedemann, S., Müller, K.-R., and Samek, W. (2021). Pruning by explaining: A novel criterion for deep neural network pruning. Pattern Recognition, 115:107899

work page 2021

[41] [41]

Zhang, P., Zeng, G., Wang, T., and Lu, W. (2024). Tinyllama: An open-source small language model. arXiv preprint arXiv:2401.02385

work page internal anchor Pith review Pith/arXiv arXiv 2024

[42] [42]

OPT: Open Pre-trained Transformer Language Models

Zhang, S., Roller, S., Goyal, N., Artetxe, M., Chen, M., Chen, S., Dewan, C., Diab, M., Li, X., Lin, X. V ., et al. (2022a). Opt: Open pre-trained transformer language models. arXiv preprint arXiv:2205.01068

work page internal anchor Pith review Pith/arXiv arXiv

[43] [43]

Zhang, Y ., Wang, G., Yang, T., Pang, T., He, Z., and Lv, J. (2022b). Compression of deep neural networks: bridging the gap between conventional-based pruning and evolutionary approach. Neural Computing and Applications, 34(19):16493–16514. 12 A Transformer Models A.1 Llama and OPT In the official implementations of Llama and OPT, the attention mechanism ...

work page

[44] [44]

6, LRP can be extended to compute relevance scores at the parameter level, ensuring that each weight element is evaluated for its direct contribution to model decisions

and as shown in Fig. 6, LRP can be extended to compute relevance scores at the parameter level, ensuring that each weight element is evaluated for its direct contribution to model decisions. LRP is typically implemented as a modified gradient method, where the gradient is scaled by an input term. As described in [7] and detailed in Eq. (8), LRP offers the...

work page

[45] [45]

achieves efficient attribution using only a forward pass. It combines weight magnitudes and activations to derive attribution scores for a given weight matrix W at layer l with input activations X, computing Rl W as: Rl W = |W| · ||X||2 (10) Rl W has the same dimensions as W. Each individual element of Rl W corresponds to a relevance score for the associa...

work page

[46] [46]

facebook/opt-125m

and apply a uniform pruning rate to rows of weight matrices across all linear layers using a row-wise unstructured approach. This method is illustrated in Fig. 7, which also compares alternative pruning strategies. In contrast to compression, we have followed these approaches for circuit discovery: • Globally Strctured: We compute an importance score for ...

work page 2048