Axiomatic Attribution for Deep Networks

Mukund Sundararajan , Ankur Taly , Qiqi Yan

Authors on Pith no claims yet

classification 💻 cs.LG

keywords attributionmethodmethodsmodelsnetworkcoupledeepfundamental

read the original abstract

We study the problem of attributing the prediction of a deep network to its input features, a problem previously studied by several other works. We identify two fundamental axioms---Sensitivity and Implementation Invariance that attribution methods ought to satisfy. We show that they are not satisfied by most known attribution methods, which we consider to be a fundamental weakness of those methods. We use the axioms to guide the design of a new attribution method called Integrated Gradients. Our method requires no modification to the original network and is extremely simple to implement; it just needs a few calls to the standard gradient operator. We apply this method to a couple of image models, a couple of text models and a chemistry model, demonstrating its ability to debug networks, to extract rules from a network, and to enable users to engage with models better.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 10 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Architecture-Aware Explanation Auditing for Industrial Visual Inspection
cs.LG 2026-05 conditional novelty 7.0

Explanation faithfulness for deep classifiers on wafer maps is highest when the explainer matches the model's native readout structure, with ViT-Tiny plus Attention Rollout achieving lower Deletion AUC than mismatched...
Homogeneous Stellar Parameters from Heterogeneous Spectra with Deep Learning
astro-ph.GA 2026-04 unverdicted novelty 7.0

A single end-to-end Transformer model unifies stellar labels from heterogeneous spectroscopic surveys into a self-consistent scale without post-hoc recalibration.
Quantifying Explanation Consistency: The C-Score Metric for CAM-Based Explainability in Medical Image Classification
cs.CV 2026-04 unverdicted novelty 7.0

The C-Score quantifies intra-class explanation consistency for CAM methods via confidence-weighted pairwise soft IoU and detects AUC-consistency dissociation as an early warning for model instability on chest X-ray cl...
Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI
cs.LG 2026-05 unverdicted novelty 6.0

Rhamba uses region-aware masking strategies and hybrid Attention-Mamba models pretrained on ABIDE fMRI data to achieve top AUROC on schizophrenia and ADHD classification tasks while outperforming prior methods.
Compared to What? Baselines and Metrics for Counterfactual Prompting
cs.CL 2026-05 conditional novelty 6.0

Counterfactual prompting effects on LLMs are often indistinguishable from those caused by meaning-preserving paraphrases, causing most previously reported demographic sensitivities to disappear under proper statistica...
Validating the Clinical Utility of CineECG 3D Reconstructions through Cross-Modal Feature Attribution
eess.IV 2026-04 unverdicted novelty 6.0

Cross-modal averaging maps ECG model attributions to CineECG 3D space, raising Dice overlap with expert annotations from 0.47 to 0.56 on 20 cases while filtering attribution noise.
TwinSpecNet: Extending APOGEE's chemical reach to low-S/N spectra via empirical paired learning
astro-ph.GA 2026-04 unverdicted novelty 6.0

TwinSpecNet uses empirical paired learning on spectral twins to denoise low-S/N APOGEE spectra and predict stellar parameters and abundances with lower scatter than the standard pipeline.
AI-Generated Images: What Humans and Machines See When They Look at the Same Image
cs.CV 2026-05 unverdicted novelty 5.0

Researchers train AI detectors on a large photorealistic fake image dataset, apply 16 XAI methods, and use human survey feedback to assess alignment between machine explanations and human perception of AI-generated images.
Rhamba: Region-Aware Hybrid Attention-Mamba Framework for Self-Supervised Learning in Resting-State fMRI
cs.LG 2026-05 unverdicted novelty 5.0

Rhamba is a region-aware hybrid Attention-Mamba framework that uses anatomically guided masking for self-supervised pretraining on ABIDE fMRI data and shows competitive AUROC on downstream schizophrenia and ADHD class...
Uncertainty-Aware Transformers: Conformal Prediction for Language Models
cs.LG 2026-04 unverdicted novelty 5.0

CONFIDE applies conformal prediction to transformer embeddings for valid prediction sets, improving accuracy up to 4.09% and efficiency over baselines on models like BERT-tiny.