The Mythos of Model Interpretability

Zachary C. Lipton

arxiv: 1606.03490 · v3 · pith:PGC2VZJXnew · submitted 2016-06-10 · 💻 cs.LG · cs.AI· cs.CV· cs.NE· stat.ML

The Mythos of Model Interpretability

Zachary C. Lipton This is my paper

classification 💻 cs.LG cs.AIcs.CVcs.NEstat.ML

keywords interpretabilitymodelsinterpretablemodelnotionsdiversemotivationswhat

0 comments

read the original abstract

Supervised machine learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? We want models to be not only good, but interpretable. And yet the task of interpretation appears underspecified. Papers provide diverse and sometimes non-overlapping motivations for interpretability, and offer myriad notions of what attributes render models interpretable. Despite this ambiguity, many papers proclaim interpretability axiomatically, absent further explanation. In this paper, we seek to refine the discourse on interpretability. First, we examine the motivations underlying interest in interpretability, finding them to be diverse and occasionally discordant. Then, we address model properties and techniques thought to confer interpretability, identifying transparency to humans and post-hoc explanations as competing notions. Throughout, we discuss the feasibility and desirability of different notions, and question the oft-made assertions that linear models are interpretable and that deep neural networks are not.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 20 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

The Attribution Contract: Feature Attribution for Generative Language Models
cs.LG 2026-05 unverdicted novelty 7.0

Introduces the Attribution Contract specification to clarify feature attribution claims in generative language models by naming the output explained, eligible features, generative process, fixed elements, and attribut...
Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Rashomon Sets and Model Multiplicity in Federated Learning
cs.LG 2026-02 unverdicted novelty 7.0

The work provides the first formal definitions of Rashomon sets for federated learning and introduces a multiplicity-aware training pipeline evaluated on standard benchmarks.
A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima
cs.LG 2025-12 unverdicted novelty 7.0

A piecewise biconvex optimization framework unifies sparse dictionary learning variants, explains their pathologies via spurious optima, and enables feature anchoring to restore identifiability.
The Attribution Contract: Feature Attribution for Generative Language Models
cs.LG 2026-05 unverdicted novelty 6.0

The paper proposes the Attribution Contract as a framework to resolve conceptual ambiguities in applying feature attribution to autoregressive and diffusion language models by explicitly specifying what is being explained.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
What Physics do Data-Driven MoCap-to-Radar Models Learn?
cs.LG 2026-04 unverdicted novelty 6.0

Data-driven MoCap-to-radar models often fail to learn underlying physics despite low reconstruction error, with temporal attention proving critical for transformers to achieve physical consistency.
On the definition and importance of interpretability in scientific machine learning
cs.LG 2025-05 conditional novelty 6.0

Interpretability in SciML requires mechanistic understanding rather than sparsity, and prior knowledge is often essential for interpretable scientific discovery.
Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation
cs.SE 2025-03 unverdicted novelty 6.0

CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-p...
The Price of Interpretability
cs.LG 2019-07 unverdicted novelty 6.0

Introduces a framework for constructing ML models via interpretable steps, generalizes standard proxies into a parametrized family of measures, and quantifies the accuracy-interpretability tradeoff via practical algorithms.
Seeing What Shouldn't Be There: Counterfactual GANs for Medical Image Attribution
cs.CV 2026-05 unverdicted novelty 5.0

A cycle-consistent GAN generates counterfactual medical images to attribute classification decisions more comprehensively than standard saliency methods.
Visual Interaction with Deep Learning Models through Collaborative Semantic Inference
cs.HC 2019-07 unverdicted novelty 5.0

Proposes the CSI framework for co-designing visual interactions and deep learning models to expose and allow semantic control over intermediate reasoning processes, shown in a summarization case study.
Optimal Explanations of Linear Models
cs.LG 2019-07 unverdicted novelty 5.0

An optimization framework decomposes linear models into increasing-complexity sequences using coordinate updates to generate parametrized interpretability metrics.
A Human-Grounded Evaluation of SHAP for Alert Processing
cs.LG 2019-07 unverdicted novelty 5.0

Human-grounded evaluation finds no significant performance improvement from adding SHAP explanations to model confidence scores in alert processing.
Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions
cs.CY 2026-02 unverdicted novelty 4.0

Current XAI methods for DNNs and LLMs rest on paradoxes and false assumptions that demand a paradigm shift to verification protocols, scientific foundations, context-aware design, and faithful model analysis rather th...
Machine Learning Approaches for Improved Scalability of Metallic Magnetic Calorimeters
physics.ins-det 2026-06 unverdicted novelty 3.0

Machine learning methods are explored for pulse classification, artifact rejection, and shape analysis in metallic magnetic calorimeters to improve scalability over traditional signal processing.
Reframing AI Loss of Control: What It Is, How to Have It, How to Lose It
cs.CY 2026-05 unverdicted novelty 3.0

A conceptual framework reframes AI loss of control by anchoring the definition of control to goal setting and alignment, arguing that such loss can occur with existing AI systems.
Generating Counterfactual and Contrastive Explanations using SHAP
cs.LG 2019-06 unverdicted novelty 3.0

Model-agnostic SHAP-based pipeline for contrastive explanations and counterfactual datapoints, evaluated on IRIS, Wine Quality, and Mobile Features datasets.
Unexplainability and Incomprehensibility of Artificial Intelligence
cs.CY 2019-06 unverdicted novelty 3.0

Advanced AI systems are unexplainable in full and produce explanations that humans cannot comprehend.
On the Semantic Interpretability of Artificial Intelligence Models
cs.AI 2019-07 unverdicted novelty 2.0

This survey classifies semantic interpretability methods in AI models by nature and feature introduction, reviews user impact, and identifies remaining gaps.