hub Mixed citations

On the robustness of interpretability methods

David Alvarez-Melis, Tommi S Jaakkola · 2018 · cs.LG · arXiv 1806.08049

Mixed citation behavior. Most common role is background (60%).

15 Pith papers citing it

Background 60% of classified citations

open full Pith review browse 15 citing papers arXiv PDF

abstract

We argue that robustness of explanations---i.e., that similar inputs should give rise to similar explanations---is a key desideratum for interpretability. We introduce metrics to quantify robustness and demonstrate that current methods do not perform well according to these metrics. Finally, we propose ways that robustness can be enforced on existing interpretability approaches.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 3 method 2

citation-polarity summary

background 3 use method 2

representative citing papers

Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions

cs.LG · 2026-05-12 · unverdicted · novelty 7.0

Outcome-fair credit models often exhibit hidden procedural bias through inconsistent reasoning across groups, which the CEC framework mitigates by enforcing consistent feature attributions via counterfactuals.

In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks

cs.LG · 2026-03-16 · unverdicted · novelty 7.0

In-context symbolic regression methods improve robustness of symbolic formula recovery from KANs, cutting median OFAT test MSE by up to 99.8 percent across hyperparameter sweeps.

Extremal Contours: Gradient-driven contours for compact visual attribution

cs.CV · 2025-11-03 · unverdicted · novelty 7.0

A training-free method using Fourier-parameterized star-convex contours optimized via gradients to generate compact, faithful visual attributions for image classifiers on benchmarks like ImageNet.

Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions?

cs.CV · 2026-04-02 · unverdicted · novelty 7.0

FASS benchmark shows post-hoc attributions remain unstable under geometric perturbations even after filtering for unchanged predictions, with Grad-CAM exhibiting the highest stability across ImageNet, COCO, and CIFAR-10.

A Unified Framework for Evaluating and Enhancing the Transparency of Explainable AI Methods via Perturbation-Gradient Consensus Attribution

cs.AI · 2024-12-05 · unverdicted · novelty 6.0

Introduces a unified evaluation framework for XAI using five principled metrics and the PGCA method that fuses grid perturbation with Grad-CAM++ , reporting top scores in fidelity, interpretability and fairness on ResNet-50 models across five image domains.

Explaining Predictions from Tree-based Boosting Ensembles

cs.LG · 2019-07-04 · unverdicted · novelty 6.0

Develops a method to find minimal input perturbations that flip GBDT predictions by extending random-forest counterfactuals to account for sequential tree dependencies and negative-gradient training.

Interpretability Can Be Actionable

cs.LG · 2026-05-11 · conditional · novelty 6.0

Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.

Beyond the Wrapper: Identifying Artifact Reliance in Static Malware Classifiers using TRUSTEE

cs.CR · 2026-05-07 · unverdicted · novelty 6.0

Static malware classifiers learn packing artifacts and dataset composition biases rather than malicious semantics, as diagnosed by TRUSTEE interpretability across controlled dataset variations.

A renormalization-group inspired lattice-based framework for piecewise generalized linear models

stat.ME · 2026-05-06 · unverdicted · novelty 6.0

RG-inspired lattice models for piecewise GLMs provide explicit interpretable partitions and a replica-analysis-derived scaling law for regularization that allows increasing complexity without expected rise in generalization loss.

NEURON: A Neuro-symbolic System for Grounded Clinical Explainability

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

NEURON raises AUC from 0.74-0.77 to 0.84-0.88 on MIMIC-IV heart-failure mortality prediction while lifting human-aligned explanation scores from 0.50 to 0.85 by grounding SHAP values in SNOMED CT and patient notes via RAG-LLM.

GESD: Beyond Outcome-Oriented Fairness

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

The paper proposes GESD, a procedural fairness metric for group disparities in explanation stability and robustness, and integrates it into the FEU multi-objective optimization framework.

Multi-Dimensional Model Integrity and Responsibility Assessment Index and Scoring Framework

cs.LG · 2026-05-14 · unverdicted · novelty 5.0

MIRAI is a unified index that combines five responsibility dimensions into one score for tabular models, demonstrating that predictive performance does not ensure high overall integrity.

ProtoSiTex: Learning Semi-Interpretable Prototypes for Multi-label Text Classification

cs.AI · 2025-10-14 · unverdicted · novelty 5.0

ProtoSiTex introduces dual-phase prototype learning with hierarchical consistency loss for semi-interpretable multi-label text classification on a new subsentence-annotated hotel review dataset.

Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs

cs.CL · 2026-04-14 · unverdicted · novelty 5.0

HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions than prior methods.

Architecture-Aware Explanation Auditing for Industrial Visual Inspection

cs.LG · 2026-05-14 · 2 refs

citing papers explorer

Showing 15 of 15 citing papers.

Do Fair Models Reason Fairly? Counterfactual Explanation Consistency for Procedural Fairness in Credit Decisions cs.LG · 2026-05-12 · unverdicted · none · ref 28 · internal anchor
Outcome-fair credit models often exhibit hidden procedural bias through inconsistent reasoning across groups, which the CEC framework mitigates by enforcing consistent feature attributions via counterfactuals.
In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks cs.LG · 2026-03-16 · unverdicted · none · ref 3 · internal anchor
In-context symbolic regression methods improve robustness of symbolic formula recovery from KANs, cutting median OFAT test MSE by up to 99.8 percent across hyperparameter sweeps.
Extremal Contours: Gradient-driven contours for compact visual attribution cs.CV · 2025-11-03 · unverdicted · none · ref 36 · internal anchor
A training-free method using Fourier-parameterized star-convex contours optimized via gradients to generate compact, faithful visual attributions for image classifiers on benchmarks like ImageNet.
Feature Attribution Stability Suite: How Stable Are Post-Hoc Attributions? cs.CV · 2026-04-02 · unverdicted · none · ref 3
FASS benchmark shows post-hoc attributions remain unstable under geometric perturbations even after filtering for unchanged predictions, with Grad-CAM exhibiting the highest stability across ImageNet, COCO, and CIFAR-10.
A Unified Framework for Evaluating and Enhancing the Transparency of Explainable AI Methods via Perturbation-Gradient Consensus Attribution cs.AI · 2024-12-05 · unverdicted · none · ref 2 · internal anchor
Introduces a unified evaluation framework for XAI using five principled metrics and the PGCA method that fuses grid perturbation with Grad-CAM++ , reporting top scores in fidelity, interpretability and fairness on ResNet-50 models across five image domains.
Explaining Predictions from Tree-based Boosting Ensembles cs.LG · 2019-07-04 · unverdicted · none · ref 2 · internal anchor
Develops a method to find minimal input perturbations that flip GBDT predictions by extending random-forest counterfactuals to account for sequential tree dependencies and negative-gradient training.
Interpretability Can Be Actionable cs.LG · 2026-05-11 · conditional · none · ref 143
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
Beyond the Wrapper: Identifying Artifact Reliance in Static Malware Classifiers using TRUSTEE cs.CR · 2026-05-07 · unverdicted · none · ref 9
Static malware classifiers learn packing artifacts and dataset composition biases rather than malicious semantics, as diagnosed by TRUSTEE interpretability across controlled dataset variations.
A renormalization-group inspired lattice-based framework for piecewise generalized linear models stat.ME · 2026-05-06 · unverdicted · none · ref 16
RG-inspired lattice models for piecewise GLMs provide explicit interpretable partitions and a replica-analysis-derived scaling law for regularization that allows increasing complexity without expected rise in generalization loss.
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability cs.AI · 2026-05-02 · unverdicted · none · ref 5
NEURON raises AUC from 0.74-0.77 to 0.84-0.88 on MIMIC-IV heart-failure mortality prediction while lifting human-aligned explanation scores from 0.50 to 0.85 by grounding SHAP values in SNOMED CT and patient notes via RAG-LLM.
GESD: Beyond Outcome-Oriented Fairness cs.LG · 2026-05-14 · unverdicted · none · ref 7 · internal anchor
The paper proposes GESD, a procedural fairness metric for group disparities in explanation stability and robustness, and integrates it into the FEU multi-objective optimization framework.
Multi-Dimensional Model Integrity and Responsibility Assessment Index and Scoring Framework cs.LG · 2026-05-14 · unverdicted · none · ref 25 · internal anchor
MIRAI is a unified index that combines five responsibility dimensions into one score for tabular models, demonstrating that predictive performance does not ensure high overall integrity.
ProtoSiTex: Learning Semi-Interpretable Prototypes for Multi-label Text Classification cs.AI · 2025-10-14 · unverdicted · none · ref 14 · internal anchor
ProtoSiTex introduces dual-phase prototype learning with hierarchical consistency loss for semi-interpretable multi-label text classification on a new subsentence-annotated hotel review dataset.
Hessian-Enhanced Token Attribution (HETA): Interpreting Autoregressive LLMs cs.CL · 2026-04-14 · unverdicted · none · ref 2
HETA is a new attribution framework for decoder-only LLMs that combines semantic transition vectors, Hessian-based sensitivity scores, and KL divergence to produce more faithful and human-aligned token attributions than prior methods.
Architecture-Aware Explanation Auditing for Industrial Visual Inspection cs.LG · 2026-05-14 · unreviewed · ref 8 · 2 links · internal anchor

On the robustness of interpretability methods

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer