The Mythos of Model Interpretability

Zachary C. Lipton

Authors on Pith no claims yet

classification 💻 cs.LG cs.AIcs.CVcs.NEstat.ML

keywords interpretabilitymodelsinterpretablemodelnotionsdiversemotivationswhat

read the original abstract

Supervised machine learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? We want models to be not only good, but interpretable. And yet the task of interpretation appears underspecified. Papers provide diverse and sometimes non-overlapping motivations for interpretability, and offer myriad notions of what attributes render models interpretable. Despite this ambiguity, many papers proclaim interpretability axiomatically, absent further explanation. In this paper, we seek to refine the discourse on interpretability. First, we examine the motivations underlying interest in interpretability, finding them to be diverse and occasionally discordant. Then, we address model properties and techniques thought to confer interpretability, identifying transparency to humans and post-hoc explanations as competing notions. Throughout, we discuss the feasibility and desirability of different notions, and question the oft-made assertions that linear models are interpretable and that deep neural networks are not.

This paper has not been read by Pith yet.

discussion (0)

Forward citations

Cited by 4 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch
cs.AI 2026-05 unverdicted novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
What Physics do Data-Driven MoCap-to-Radar Models Learn?
cs.LG 2026-04 unverdicted novelty 6.0

Data-driven MoCap-to-radar models often fail to learn underlying physics despite low reconstruction error, with temporal attention proving critical for transformers to achieve physical consistency.
Seeing What Shouldn't Be There: Counterfactual GANs for Medical Image Attribution
cs.CV 2026-05 unverdicted novelty 5.0

A cycle-consistent GAN generates counterfactual medical images to attribute classification decisions more comprehensively than standard saliency methods.