Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

· 2025 · cs.LG · arXiv 2506.13727

1 Pith paper cite this work. Polarity classification is still indexing.

1 Pith paper citing it

open full Pith review browse 1 citing papers arXiv PDF

abstract

Large Language Models (LLMs) are widely deployed in real-world applications, yet their internal mechanisms remain difficult to interpret and control, limiting our ability to diagnose and correct undesirable behaviors. Mechanistic interpretability addresses this challenge by identifying circuits -- subsets of model components responsible for specific behaviors. However, discovering such circuits in LLMs remains difficult due to their scale and complexity. We frame circuit discovery as identifying parameters that contribute most to model outputs on task-specific inputs, and use Layer-wise Relevance Propagation (LRP) with reference samples to attribute and extract these components via pruning. Building on this, we introduce contrastive relevance to isolate circuits associated with undesired behaviors while preserving general capabilities, enabling targeted model correction. On OPT-125M, we show that pruning as little as ~0.3% of neurons substantially reduces toxic outputs, while pruning approximately 0.03% of weight elements mitigates repetitive text generation without degrading general performance. These results establish attribution-guided pruning as an effective mechanism for identifying and intervening on behavior-specific circuits in LLMs. We further validate our findings on additional small-scale language models, demonstrating that the proposed approach transfers across architectures. Our code is publicly available at https://github.com/erfanhatefi/SparC3.

representative citing papers

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models

cs.CL · 2026-04-27 · unverdicted · novelty 6.0

CAP scores attention heads via interventional masking on reasoning calibration data and converts those scores into weight pruning decisions, reporting up to 61% relative accuracy gains over Wanda at 20% sparsity on ARC-Challenge.

citing papers explorer

Showing 1 of 1 citing paper.

Pruning via Causal Attribution Preserves Reasoning Performance in Large Language Models cs.CL · 2026-04-27 · unverdicted · none · ref 5 · internal anchor
CAP scores attention heads via interventional masking on reasoning calibration data and converts those scores into weight pruning decisions, reporting up to 61% relative accuracy gains over Wanda at 20% sparsity on ARC-Challenge.

Attribution-Guided Pruning for Insight and Control: Circuit Discovery and Targeted Correction in Small-scale LLMs

fields

years

verdicts

representative citing papers

citing papers explorer