hub Canonical reference

Towards A Rigorous Science of Interpretable Machine Learning

Finale Doshi-Velez, Been Kim · 2017 · stat.ML · arXiv 1702.08608

Canonical reference. 71% of citing Pith papers cite this work as background.

71 Pith papers citing it

Background 71% of classified citations

open full Pith review browse 71 citing papers arXiv PDF

abstract

As machine learning systems become ubiquitous, there has been a surge of interest in interpretable machine learning: systems that provide explanation for their outputs. These explanations are often used to qualitatively assess other criteria such as safety or non-discrimination. However, despite the interest in interpretability, there is very little consensus on what interpretable machine learning is and how it should be measured. In this position paper, we first define interpretability and describe when interpretability is needed (and when it is not). Next, we suggest a taxonomy for rigorous evaluation and expose open questions towards a more rigorous science of interpretable machine learning.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 13 method 1

citation-polarity summary

background 10 support 3 use method 1

representative citing papers

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction

cs.LG · 2026-05-03 · unverdicted · novelty 7.0

ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.

In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks

cs.LG · 2026-03-16 · unverdicted · novelty 7.0

In-context symbolic regression methods improve robustness of symbolic formula recovery from KANs, cutting median OFAT test MSE by up to 99.8 percent across hyperparameter sweeps.

Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild

cs.SE · 2026-01-25 · conditional · novelty 7.0

Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.

Extremal Contours: Gradient-driven contours for compact visual attribution

cs.CV · 2025-11-03 · unverdicted · novelty 7.0

A training-free method using Fourier-parameterized star-convex contours optimized via gradients to generate compact, faithful visual attributions for image classifiers on benchmarks like ImageNet.

Temporal Counterfactual Explanations of Behaviour Tree Decisions

cs.RO · 2025-09-09 · unverdicted · novelty 7.0

A method automatically constructs a causal model from behavior tree structure and domain knowledge to generate real-time causal counterfactual explanations for robot decisions.

Mechanistic Interpretability with Sparse Autoencoder Neural Operators

cs.LG · 2025-09-03 · unverdicted · novelty 7.0

SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.

MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization

cs.CV · 2025-08-11 · unverdicted · novelty 7.0

MIMIC is a new inversion framework that recovers visual concepts from VLM internal states using joint inversion, feature alignment, and three regularizers.

Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting

cs.CL · 2023-05-07 · accept · novelty 7.0

Chain-of-thought explanations in LLMs are frequently unfaithful: models systematically omit mention of biasing prompt features that change their answers and instead produce rationalizations for those biased outputs.

Investigating Concept Alignment Using Implausible Category Members

cs.AI · 2026-05-20 · unverdicted · novelty 6.0

AI models misalign with humans on concept boundaries when probed with implausible category members, such as classifying words as vehicles or vegetables as fruit.

Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites

cs.CV · 2026-05-19 · unverdicted · novelty 6.0

p-ResNet-50 adds a prototype layer with anchor- and medoid-based regularizations to ResNet-50, achieving ROC-AUC 0.994 and accuracy 0.957 on ~12k XCT patches while supplying case-based explanations aligned to expert categories.

Bridging the Disciplinary Gap in Explainable AI: From Abstract Desiderata to Concrete Tasks

cs.CY · 2026-05-19 · unverdicted · novelty 6.0

The authors introduce a taxonomy with target, functional role, and mode of justification axes plus a framework that decomposes abstract XAI desiderata into concrete benchmarkable tasks via identified dependency structures.

Entropy-Based Characterisation of the Polarised Regime in Latent Variable Models

cs.LG · 2026-05-15 · unverdicted · novelty 6.0

An entropy criterion on mean representations characterises the polarised regime in VAEs and related models, with theoretical links to KL minimisation and empirical tests across several architectures.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

Interpretability Can Be Actionable

cs.LG · 2026-05-11 · conditional · novelty 6.0

Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.

The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime

cs.AI · 2026-05-11 · unverdicted · novelty 6.0

AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.

ShifaMind: A Multiplicative Concept Bottleneck for Interpretable ICD-10 Coding

cs.LG · 2026-05-08 · unverdicted · novelty 6.0

ShifaMind achieves competitive performance with the LAAT baseline on MIMIC-IV top-50 ICD-10 coding while outperforming vanilla concept bottleneck models and providing concept-mediated explanations.

Evaluation Cards for XAI Metrics

cs.CV · 2026-05-06 · unverdicted · novelty 6.0

The authors introduce the XAI Evaluation Card template to standardize how XAI evaluation metrics are defined, validated, and reported.

NEURON: A Neuro-symbolic System for Grounded Clinical Explainability

cs.AI · 2026-05-02 · unverdicted · novelty 6.0

NEURON raises AUC from 0.74-0.77 to 0.84-0.88 on MIMIC-IV heart-failure mortality prediction while lifting human-aligned explanation scores from 0.50 to 0.85 by grounding SHAP values in SNOMED CT and patient notes via RAG-LLM.

Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings

cs.LG · 2026-04-24 · unverdicted · novelty 6.0

In high-stakes settings, Shapley explanations increase analyst confidence but do not improve decision accuracy, and standard metrics fail to predict human utility.

Design Guidelines for Game-Based Refresher Training of Community Health Workers in Low-Resource Contexts

cs.HC · 2026-04-06 · unverdicted · novelty 6.0

A four-year mixed-methods study of game-based systems for Indian CHWs yields eight design guidelines for sustained engagement, learning transfer, and contextual appropriateness in low-resource health training.

X-SYS: A Reference Architecture for Interactive Explanation Systems

cs.AI · 2026-02-13 · unverdicted · novelty 6.0

X-SYS is a reference architecture for interactive explanation systems organized around STAR quality attributes and five service components, demonstrated via SemanticLens for vision-language models.

Faster Verified Explanations for Neural Networks

cs.LG · 2025-11-28 · unverdicted · novelty 6.0

FaVeX accelerates verified explanations for neural networks via dynamic batch-sequential processing and query reuse while introducing verifier-optimal robust explanations that incorporate verifier incompleteness.

Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach

econ.EM · 2025-11-03 · unverdicted · novelty 6.0

A new framework combines AI-derived concept embeddings with high-dimensional selective inference to enable statistically principled, interpretable discovery from unstructured data in empirical economics.

citing papers explorer

Showing 50 of 71 citing papers.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch cs.AI · 2026-05-05 · unverdicted · none · ref 20 · internal anchor
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
ISAAC: Auditing Causal Reasoning in Deep Models for Drug-Target Interaction cs.LG · 2026-05-03 · unverdicted · none · ref 15 · internal anchor
ISAAC auditing applied to three DTI models on the Davis benchmark finds 25% relative differences in causal reasoning scores despite nearly identical AUROC values.
In-Context Symbolic Regression for Robustness-Improved Kolmogorov-Arnold Networks cs.LG · 2026-03-16 · unverdicted · none · ref 8 · internal anchor
In-context symbolic regression methods improve robustness of symbolic formula recovery from KANs, cutting median OFAT test MSE by up to 99.8 percent across hyperparameter sweeps.
Results-Actionability Gap: Understanding How Practitioners Evaluate LLM Products in the Wild cs.SE · 2026-01-25 · conditional · none · ref 16 · internal anchor
Qualitative study of 19 practitioners reveals ten LLM product evaluation practices and introduces the results-actionability gap as a key barrier to turning findings into improvements.
Extremal Contours: Gradient-driven contours for compact visual attribution cs.CV · 2025-11-03 · unverdicted · none · ref 8 · internal anchor
A training-free method using Fourier-parameterized star-convex contours optimized via gradients to generate compact, faithful visual attributions for image classifiers on benchmarks like ImageNet.
Temporal Counterfactual Explanations of Behaviour Tree Decisions cs.RO · 2025-09-09 · unverdicted · none · ref 9 · internal anchor
A method automatically constructs a causal model from behavior tree structure and domain knowledge to generate real-time causal counterfactual explanations for robot decisions.
Mechanistic Interpretability with Sparse Autoencoder Neural Operators cs.LG · 2025-09-03 · unverdicted · none · ref 72 · internal anchor
SAE-NOs extend sparse autoencoders to function spaces via Fourier neural operators with concept and domain sparsity, learning localized patterns more efficiently and generalizing across discretizations on vision data.
MIMIC: Multimodal Inversion for Model Interpretation and Conceptualization cs.CV · 2025-08-11 · unverdicted · none · ref 6 · internal anchor
MIMIC is a new inversion framework that recovers visual concepts from VLM internal states using joint inversion, feature alignment, and three regularizers.
Language Models Don't Always Say What They Think: Unfaithful Explanations in Chain-of-Thought Prompting cs.CL · 2023-05-07 · accept · none · ref 1 · internal anchor
Chain-of-thought explanations in LLMs are frequently unfaithful: models systematically omit mention of biasing prompt features that change their answers and instead produce rationalizations for those biased outputs.
Investigating Concept Alignment Using Implausible Category Members cs.AI · 2026-05-20 · unverdicted · none · ref 10 · internal anchor
AI models misalign with humans on concept boundaries when probed with implausible category members, such as classifying words as vehicles or vegetables as fruit.
Interpretable Computer Vision for Defect Detection in X-ray Tomography of Aerospace SiC/SiC Composites cs.CV · 2026-05-19 · unverdicted · none · ref 20 · internal anchor
p-ResNet-50 adds a prototype layer with anchor- and medoid-based regularizations to ResNet-50, achieving ROC-AUC 0.994 and accuracy 0.957 on ~12k XCT patches while supplying case-based explanations aligned to expert categories.
Bridging the Disciplinary Gap in Explainable AI: From Abstract Desiderata to Concrete Tasks cs.CY · 2026-05-19 · unverdicted · none · ref 27 · internal anchor
The authors introduce a taxonomy with target, functional role, and mode of justification axes plus a framework that decomposes abstract XAI desiderata into concrete benchmarkable tasks via identified dependency structures.
Entropy-Based Characterisation of the Polarised Regime in Latent Variable Models cs.LG · 2026-05-15 · unverdicted · none · ref 58 · internal anchor
An entropy criterion on mean representations characterises the polarised regime in VAEs and related models, with theoretical links to KL minimisation and empirical tests across several architectures.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 175 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Interpretability Can Be Actionable cs.LG · 2026-05-11 · conditional · none · ref 42 · internal anchor
Interpretability research should be judged by actionability—the degree to which its insights support concrete decisions and interventions—rather than explanatory power alone.
The Open-Box Fallacy: Why AI Deployment Needs a Calibrated Verification Regime cs.AI · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
AI deployment in high-stakes areas requires domain-scoped calibrated verification with monitoring and revocation, using a proposed six-component Verification Coverage standard instead of mechanistic interpretability.
ShifaMind: A Multiplicative Concept Bottleneck for Interpretable ICD-10 Coding cs.LG · 2026-05-08 · unverdicted · none · ref 4 · internal anchor
ShifaMind achieves competitive performance with the LAAT baseline on MIMIC-IV top-50 ICD-10 coding while outperforming vanilla concept bottleneck models and providing concept-mediated explanations.
Evaluation Cards for XAI Metrics cs.CV · 2026-05-06 · unverdicted · none · ref 4 · internal anchor
The authors introduce the XAI Evaluation Card template to standardize how XAI evaluation metrics are defined, validated, and reported.
NEURON: A Neuro-symbolic System for Grounded Clinical Explainability cs.AI · 2026-05-02 · unverdicted · none · ref 27 · internal anchor
NEURON raises AUC from 0.74-0.77 to 0.84-0.88 on MIMIC-IV heart-failure mortality prediction while lifting human-aligned explanation scores from 0.50 to 0.85 by grounding SHAP values in SNOMED CT and patient notes via RAG-LLM.
Rethinking XAI Evaluation: A Human-Centered Audit of Shapley Benchmarks in High-Stakes Settings cs.LG · 2026-04-24 · unverdicted · none · ref 20 · internal anchor
In high-stakes settings, Shapley explanations increase analyst confidence but do not improve decision accuracy, and standard metrics fail to predict human utility.
Design Guidelines for Game-Based Refresher Training of Community Health Workers in Low-Resource Contexts cs.HC · 2026-04-06 · unverdicted · none · ref 9 · internal anchor
A four-year mixed-methods study of game-based systems for Indian CHWs yields eight design guidelines for sustained engagement, learning transfer, and contextual appropriateness in low-resource health training.
X-SYS: A Reference Architecture for Interactive Explanation Systems cs.AI · 2026-02-13 · unverdicted · none · ref 28 · internal anchor
X-SYS is a reference architecture for interactive explanation systems organized around STAR quality attributes and five service components, demonstrated via SemanticLens for vision-language models.
Faster Verified Explanations for Neural Networks cs.LG · 2025-11-28 · unverdicted · none · ref 6 · internal anchor
FaVeX accelerates verified explanations for neural networks via dynamic batch-sequential processing and query reuse while introducing verifier-optimal robust explanations that incorporate verifier incompleteness.
Making Interpretable Discoveries from Unstructured Data: A High-Dimensional Multiple Hypothesis Testing Approach econ.EM · 2025-11-03 · unverdicted · none · ref 34 · internal anchor
A new framework combines AI-derived concept embeddings with high-dimensional selective inference to enable statistically principled, interpretable discovery from unstructured data in empirical economics.
On the definition and importance of interpretability in scientific machine learning cs.LG · 2025-05-16 · conditional · none · ref 20 · internal anchor
Interpretability in SciML requires mechanistic understanding rather than sparsity, and prior knowledge is often essential for interpretable scientific discovery.
Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation cs.SE · 2025-03-21 · unverdicted · none · ref 20 · internal anchor
CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-person study.
A Unified Framework for Evaluating and Enhancing the Transparency of Explainable AI Methods via Perturbation-Gradient Consensus Attribution cs.AI · 2024-12-05 · unverdicted · none · ref 5 · internal anchor
Introduces a unified evaluation framework for XAI using five principled metrics and the PGCA method that fuses grid perturbation with Grad-CAM++ , reporting top scores in fidelity, interpretability and fairness on ResNet-50 models across five image domains.
Ethical and social risks of harm from Language Models cs.CL · 2021-12-08 · accept · none · ref 70 · internal anchor
The authors provide a detailed taxonomy of 21 risks associated with language models, covering discrimination, information leaks, misinformation, malicious applications, interaction harms, and societal impacts like job loss and environmental costs.
Interpretable and Steerable Sequence Learning via Prototypes cs.LG · 2019-07-23 · unverdicted · none · ref 11 · internal anchor
ProSeNet learns a sparse set of prototypes for case-based explanations in deep sequence models, matches state-of-the-art accuracy on several tasks, and supports manual prototype refinement by non-experts.
The Price of Interpretability cs.LG · 2019-07-08 · unverdicted · none · ref 14 · internal anchor
Introduces a framework for constructing ML models via interpretable steps, generalizes standard proxies into a parametrized family of measures, and quantifies the accuracy-interpretability tradeoff via practical algorithms.
SINAPSE: A lightweight deep learning framework for accurate and explainable neutron-$\gamma$ discrimination physics.ins-det · 2026-05-13 · unverdicted · none · ref 24 · internal anchor
SINAPSE uses a dual-branch neural network with a 1D convolutional autoencoder for denoising and a classifier for neutron-gamma discrimination, trained via random augmentations on high-SNR data and validated with SHAP explanations.
Evaluating the False Trust Engendered by LLM Explanations cs.HC · 2026-05-11 · unverdicted · none · ref 6 · 2 links · internal anchor
LLM reasoning traces and post-hoc explanations increase false trust in incorrect predictions, whereas contrastive dual explanations enhance users' ability to distinguish correct from incorrect AI outputs.
NeuroViz: Real-time Interactive Visualization of Forward and Backward Passes in Neural Network Training cs.LG · 2026-05-03 · unverdicted · none · ref 10 · internal anchor
NeuroViz offers interactive real-time visualization of neural network forward and backward passes, achieving top usability scores in a study with 31 participants compared to existing tools.
CoAX: Cognitive-Oriented Attribution eXplanation User Model of Human Understanding of AI Explanations cs.AI · 2026-04-30 · unverdicted · none · ref 28 · internal anchor
Cognitive models of user reasoning strategies with XAI methods on tabular data fit human forward-simulation decisions better than ML baselines and support hypothesis testing without new user studies.
X-NegoBox: An Explainable Privacy-Budget Negotiation Framework for Secure Peer-to-Peer Energy Data Exchange cs.CR · 2026-04-27 · unverdicted · none · ref 21 · internal anchor
X-NegoBox is a proposed explainable framework that negotiates privacy budgets for energy data exchange using trust, sensitivity, and purpose factors, with experiments claiming reduced leakage and higher acceptance rates.
From Awareness to Intent: Mitigating Silent Driving System Failures through Prospective Situation Awareness Enhancing Interfaces cs.HC · 2026-04-20 · unverdicted · none · ref 31 · internal anchor
Prospective situation awareness enhancing interfaces delivered via AR HUD improve takeover performance after silent automation failures, with perceptual cues most effective at raising situational awareness and system-intent messages best at building trust.
Domain-Specialized Object Detection via Model-Level Mixtures of Experts cs.CV · 2026-04-20 · unverdicted · none · ref 12 · internal anchor
Model-level MoE of domain-specialized YOLO detectors with gating network outperforms standard ensembles on BDD100K while revealing expert specialization.
Interpretable and Explainable Surrogate Modeling for Simulations: A State-of-the-Art Survey and Perspectives on Explainable AI for Decision-Making cs.AI · 2026-04-15 · unverdicted · none · ref 147 · internal anchor
This survey synthesizes XAI methods with surrogate modeling workflows for simulations and outlines a research agenda to embed explainability into simulation-driven design and decision-making.
Governed Reasoning for Institutional AI cs.AI · 2026-04-12 · unverdicted · none · ref 3 · internal anchor
Cognitive Core uses nine typed cognitive primitives, a four-tier governance model with human review as an execution condition, and an endogenous audit ledger to reach 91% accuracy with zero silent errors on prior authorization appeals, outperforming ReAct and Plan-and-Solve baselines.
Explainability and Certification of AI-Generated Educational Assessments cs.CY · 2026-03-18 · unverdicted · none · ref 19 · internal anchor
A framework using self-rationalization, attribution analysis, and a certification metadata schema with traffic-light workflow enables transparent, audit-ready AI-generated educational assessments aligned to Bloom's and SOLO taxonomies.
Agentic AI for Cybersecurity: A Meta-Cognitive Architecture for Governable Autonomy cs.CR · 2026-02-12 · unverdicted · none · ref 18 · internal anchor
A meta-cognitive agentic framework coordinates specialized cybersecurity agents through a judgment mechanism to improve decision quality under uncertainty and noise on standard benchmarks.
A Neuro-Symbolic Framework for Accountability in Public-Sector AI cs.CY · 2025-12-13 · unverdicted · none · ref 100 · internal anchor
A framework combining legal ontology, rule extraction, and solver reasoning verifies whether AI explanations for CalFresh eligibility align with statutory constraints.
Why Johnny Can't Use Agents: Industry Aspirations vs. User Realities with AI Agents cs.HC · 2025-09-18 · unverdicted · none · ref 29 · internal anchor
Industry markets AI agents for orchestration, creation, and insight, but a usability study with 31 participants reveals users face challenges from capability misalignment and lack of meta-cognition in tools like Operator and Manus.
Detection of Real-world Driving-induced Affective State Using Physiological Signals and Multi-view Multi-task Machine Learning cs.LG · 2019-07-19 · unverdicted · none · ref 12 · internal anchor
A multi-view multi-task ML method detects real-world driving-induced affective states using physiological signals by modeling inter-drive variability, with results showing performance gains on three datasets.
Optimal Explanations of Linear Models cs.LG · 2019-07-08 · unverdicted · none · ref 28 · internal anchor
An optimization framework decomposes linear models into increasing-complexity sequences using coordinate updates to generate parametrized interpretability metrics.
A Human-Grounded Evaluation of SHAP for Alert Processing cs.LG · 2019-07-07 · unverdicted · none · ref 1 · internal anchor
Human-grounded evaluation finds no significant performance improvement from adding SHAP explanations to model confidence scores in alert processing.
Do Transformer Attention Heads Provide Transparency in Abstractive Summarization? cs.CL · 2019-07-01 · unverdicted · none · ref 5 · internal anchor
Analysis of transformer attention heads in abstractive summarization shows specialization in some heads and proposes a method to measure model reliance on learned attention distributions.
Interpretable Question Answering on Knowledge Bases and Text cs.CL · 2019-06-26 · unverdicted · none · ref 7 · internal anchor
Compares LIME, input perturbation and attention for explaining QA on KB+text; proposes automatic evaluation paradigm and finds input perturbation superior in both automatic and human studies.
Artificial Adaptive Intelligence: The Missing Stage Between Narrow and General Intelligence cs.AI · 2026-05-16 · unverdicted · none · ref 6 · internal anchor
Proposes Artificial Adaptive Intelligence as the regime between narrow and general AI, defined by elimination of human-specified hyperparameters, and introduces an adaptivity index plus parametric minimality principle grounded in minimum description length.
Explanation-Aware Learning for Enhanced Interpretability in Biomedical Imaging cs.CV · 2026-05-11 · unverdicted · none · ref 10 · internal anchor
Adding explanation supervision to training improves spatial alignment of saliency maps with clinical annotations on chest X-rays while keeping predictive accuracy comparable.

Towards A Rigorous Science of Interpretable Machine Learning

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer