On the Robustness of Interpretability Methods

David Alvarez-Melis; Tommi S. Jaakkola

arxiv: 1806.08049 · v1 · pith:G7RGBTGNnew · submitted 2018-06-21 · 💻 cs.LG · stat.ML

On the Robustness of Interpretability Methods

David Alvarez-Melis , Tommi S. Jaakkola This is my paper

classification 💻 cs.LG stat.ML

keywords robustnessinterpretabilitymethodsmetricssimilaraccordingapproachesargue

0 comments

read the original abstract

We argue that robustness of explanations---i.e., that similar inputs should give rise to similar explanations---is a key desideratum for interpretability. We introduce metrics to quantify robustness and demonstrate that current methods do not perform well according to these metrics. Finally, we propose ways that robustness can be enforced on existing interpretability approaches.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 16 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Architecture-Aware Explanation Auditing for Industrial Visual Inspection
cs.LG 2026-05 conditional novelty 7.0

Explanation faithfulness for deep classifiers on wafer maps is highest when the explainer matches the model's native readout structure, with ViT-Tiny plus Attention Rollout achieving lower Deletion AUC than mismatched...
Architecture-Aware Explanation Auditing for Industrial Visual Inspection
cs.LG 2026-05 unverdicted novelty 7.0

An audit protocol on wafer maps finds that ViT-Tiny with Attention Rollout achieves better deletion faithfulness than other models and explainers, with readout structure as the key factor and RISE outperforming native...
Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions
cs.LG 2026-05 unverdicted novelty 7.0

Outcome-fair credit models often exhibit hidden procedural bias through inconsistent reasoning across groups, which the CEC framework mitigates by enforcing consistent feature attributions via counterfactuals.
Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?
cs.CV 2026-04 unverdicted novelty 7.0

FASS benchmark shows post-hoc attributions remain unstable under geometric perturbations even after filtering for unchanged predictions, with Grad-CAM exhibiting the highest stability across ImageNet, COCO, and CIFAR-10.
In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks
cs.LG 2026-03 unverdicted novelty 7.0

In-context symbolic regression methods improve robustness of symbolic formula recovery from KANs, cutting median OFAT test MSE by up to 99.8 percent across hyperparameter sweeps.
Extremal Contours: Gradient-driven contours for compact visual attribution
cs.CV 2025-11 unverdicted novelty 7.0

A training-free method using Fourier-parameterized star-convex contours optimized via gradients to generate compact, faithful visual attributions for image classifiers on benchmarks like ImageNet.
Interpretability Can Be Actionable
cs.LG 2026-05 conditional novelty 6.0

Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Beyond the Wrapper: Identifying Artifact Reliance in Static Malware Classifiers using TRUSTEE
cs.CR 2026-05 unverdicted novelty 6.0

Static malware classifiers learn packing artifacts and dataset composition biases rather than malicious semantics, as diagnosed by TRUSTEE interpretability across controlled dataset variations.
A renormalization-group inspired lattice-based framework for piecewise generalized linear models
stat.ME 2026-05 unverdicted novelty 6.0

RG-inspired lattice models for piecewise GLMs provide explicit interpretable partitions and a replica-analysis-derived scaling law for regularization that allows increasing complexity without expected rise in generali...
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability
cs.AI 2026-05 unverdicted novelty 6.0

NEURON raises AUC from 0.74-0.77 to 0.84-0.88 on MIMIC-IV heart-failure mortality prediction while lifting human-aligned explanation scores from 0.50 to 0.85 by grounding SHAP values in SNOMED CT and patient notes via...
A Unified Framework for Evaluating and Enhancing the Transparency of Explainable AI Methods via Perturbation-Gradient Consensus Attribution
cs.AI 2024-12 unverdicted novelty 6.0

Introduces a unified evaluation framework for XAI using five principled metrics and the PGCA method that fuses grid perturbation with Grad-CAM++ , reporting top scores in fidelity, interpretability and fairness on Res...
Explaining Predictions from Tree-based Boosting Ensembles
cs.LG 2019-07 unverdicted novelty 6.0

Develops a method to find minimal input perturbations that flip GBDT predictions by extending random-forest counterfactuals to account for sequential tree dependencies and negative-gradient training.
GESD: Beyond Outcome-Oriented Fairness
cs.LG 2026-05 unverdicted novelty 5.0

The paper proposes GESD, a procedural fairness metric for group disparities in explanation stability and robustness, and integrates it into the FEU multi-objective optimization framework.
Multi-Dimensional Model Integrity and Responsibility Assessment Index and Scoring Framework
cs.LG 2026-05 unverdicted novelty 5.0

MIRAI is a unified index that combines five responsibility dimensions into one score for tabular models, demonstrating that predictive performance does not ensure high overall integrity.
Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs
cs.CL 2026-04 unverdicted novelty 5.0

HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions th...
ProtoSiTex: Learning Semi-Interpretable Prototypes for Multi-label Text Classification
cs.AI 2025-10 unverdicted novelty 5.0

ProtoSiTex introduces dual-phase prototype learning with hierarchical consistency loss for semi-interpretable multi-label text classification on a new subsentence-annotated hotel review dataset.