The Mythos of Model Interpretability
read the original abstract
Supervised machine learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? We want models to be not only good, but interpretable. And yet the task of interpretation appears underspecified. Papers provide diverse and sometimes non-overlapping motivations for interpretability, and offer myriad notions of what attributes render models interpretable. Despite this ambiguity, many papers proclaim interpretability axiomatically, absent further explanation. In this paper, we seek to refine the discourse on interpretability. First, we examine the motivations underlying interest in interpretability, finding them to be diverse and occasionally discordant. Then, we address model properties and techniques thought to confer interpretability, identifying transparency to humans and post-hoc explanations as competing notions. Throughout, we discuss the feasibility and desirability of different notions, and question the oft-made assertions that linear models are interpretable and that deep neural networks are not.
This paper has not been read by Pith yet.
Forward citations
Cited by 20 Pith papers
-
The Attribution Contract: Feature Attribution for Generative Language Models
Introduces the Attribution Contract specification to clarify feature attribution claims in generative language models by naming the output explained, eligible features, generative process, fixed elements, and attribut...
-
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
-
Rashomon Sets and Model Multiplicity in Federated Learning
The work provides the first formal definitions of Rashomon sets for federated learning and introduces a multiplicity-aware training pipeline evaluated on standard benchmarks.
-
A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
A piecewise biconvex optimization framework unifies sparse dictionary learning variants, explains their pathologies via spurious optima, and enables feature anchoring to restore identifiability.
-
The Attribution Contract: Feature Attribution for Generative Language Models
The paper proposes the Attribution Contract as a framework to resolve conceptual ambiguities in applying feature attribution to autoregressive and diffusion language models by explicitly specifying what is being explained.
-
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
-
What Physics do Data-Driven MoCap-to-Radar Models Learn?
Data-driven MoCap-to-radar models often fail to learn underlying physics despite low reconstruction error, with temporal attention proving critical for transformers to achieve physical consistency.
-
On the definition and importance of interpretability in scientific machine learning
Interpretability in SciML requires mechanistic understanding rather than sparsity, and prior knowledge is often essential for interpretable scientific discovery.
-
Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation
CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-p...
-
The Price of Interpretability
Introduces a framework for constructing ML models via interpretable steps, generalizes standard proxies into a parametrized family of measures, and quantifies the accuracy-interpretability tradeoff via practical algorithms.
-
Seeing What Shouldn't Be There: Counterfactual GANs for Medical Image Attribution
A cycle-consistent GAN generates counterfactual medical images to attribute classification decisions more comprehensively than standard saliency methods.
-
Visual Interaction with Deep Learning Models through Collaborative Semantic Inference
Proposes the CSI framework for co-designing visual interactions and deep learning models to expose and allow semantic control over intermediate reasoning processes, shown in a summarization case study.
-
Optimal Explanations of Linear Models
An optimization framework decomposes linear models into increasing-complexity sequences using coordinate updates to generate parametrized interpretability metrics.
-
A Human-Grounded Evaluation of SHAP for Alert Processing
Human-grounded evaluation finds no significant performance improvement from adding SHAP explanations to model confidence scores in alert processing.
-
Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions
Current XAI methods for DNNs and LLMs rest on paradoxes and false assumptions that demand a paradigm shift to verification protocols, scientific foundations, context-aware design, and faithful model analysis rather th...
-
Machine Learning Approaches for Improved Scalability of Metallic Magnetic Calorimeters
Machine learning methods are explored for pulse classification, artifact rejection, and shape analysis in metallic magnetic calorimeters to improve scalability over traditional signal processing.
-
Reframing AI Loss of Control: What It Is, How to Have It, How to Lose It
A conceptual framework reframes AI loss of control by anchoring the definition of control to goal setting and alignment, arguing that such loss can occur with existing AI systems.
-
Generating Counterfactual and Contrastive Explanations using SHAP
Model-agnostic SHAP-based pipeline for contrastive explanations and counterfactual datapoints, evaluated on IRIS, Wine Quality, and Mobile Features datasets.
-
Unexplainability and Incomprehensibility of Artificial Intelligence
Advanced AI systems are unexplainable in full and produce explanations that humans cannot comprehend.
-
On the Semantic Interpretability of Artificial Intelligence Models
This survey classifies semantic interpretability methods in AI models by nature and feature introduction, reviews user impact, and identifies remaining gaps.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.