Dominant error in attribution patching arises from downstream non-linearities; a single HVP correction removes the leading error term and matches Integrated Gradients accuracy at lower cost across 124M-9B models.
From robustness to improved generalization and calibration in pre-trained language models.Transactions of the Association for Computational Linguistics, 13:264–280, 2025
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.LG 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
When Attribution Patching Lies: Diagnosis and a Second-Order Correction
Dominant error in attribution patching arises from downstream non-linearities; a single HVP correction removes the leading error term and matches Integrated Gradients accuracy at lower cost across 124M-9B models.