hub

The Mythos of Model Interpretability

Zachary C Lipton · 2016 · cs.LG · DOI 10.15587/1729-4061.2016.79356 · arXiv 1606.03490

20 Pith papers cite this work. Polarity classification is still indexing.

20 Pith papers citing it

open full Pith review browse 20 citing papers arXiv PDF

abstract

Supervised machine learning models boast remarkable predictive capabilities. But can you trust your model? Will it work in deployment? What else can it tell you about the world? We want models to be not only good, but interpretable. And yet the task of interpretation appears underspecified. Papers provide diverse and sometimes non-overlapping motivations for interpretability, and offer myriad notions of what attributes render models interpretable. Despite this ambiguity, many papers proclaim interpretability axiomatically, absent further explanation. In this paper, we seek to refine the discourse on interpretability. First, we examine the motivations underlying interest in interpretability, finding them to be diverse and occasionally discordant. Then, we address model properties and techniques thought to confer interpretability, identifying transparency to humans and post-hoc explanations as competing notions. Throughout, we discuss the feasibility and desirability of different notions, and question the oft-made assertions that linear models are interpretable and that deep neural networks are not.

hub tools

JSON dossier citing papers JSON publisher DOI arXiv source

citation-role summary

background 2

citation-polarity summary

background 1 support 1

representative citing papers

Rashomon Sets and Model Multiplicity in Federated Learning

cs.LG · 2026-02-10 · unverdicted · novelty 7.0

The work provides the first formal definitions of Rashomon sets for federated learning and introduces a multiplicity-aware training pipeline evaluated on standard benchmarks.

A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima

cs.LG · 2025-12-05 · unverdicted · novelty 7.0

A piecewise biconvex optimization framework unifies sparse dictionary learning variants, explains their pathologies via spurious optima, and enables feature anchoring to restore identifiability.

Agentic-imodels: Evolving agentic interpretability tools via autoresearch

cs.AI · 2026-05-05 · unverdicted · novelty 7.0

Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.

Interpretable Human-Label-Free Deep Learning for Real-Bogus Classification with Uncertainty Quantification

astro-ph.IM · 2026-07-06 · conditional · novelty 6.0

A dual-network asymmetric co-teaching framework trained on injected transients and contaminated survey data achieves human-label-free real-bogus classification with calibrated uncertainty.

Machine Learning Approaches for Improved Scalability of Metallic Magnetic Calorimeters

physics.ins-det · 2026-06-23 · conditional · novelty 6.0

The paper demonstrates two ML pipelines — clustering-based artifact rejection and a CVAE-plus-regressor trained on simulations — that classify MMC pulses and extract energy spectra comparable to FIR filtering.

The Attribution Contract: Feature Attribution for Generative Language Models

cs.LG · 2026-05-21 · unverdicted · novelty 6.0

The paper proposes the Attribution Contract as a framework to resolve conceptual ambiguities in applying feature attribution to autoregressive and diffusion language models by explicitly specifying what is being explained.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

On the definition and importance of interpretability in scientific machine learning

cs.LG · 2025-05-16 · conditional · novelty 6.0

Interpretability in SciML requires mechanistic understanding rather than sparsity, and prior knowledge is often essential for interpretable scientific discovery.

Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation

cs.SE · 2025-03-21 · unverdicted · novelty 6.0

CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-person study.

The Price of Interpretability

cs.LG · 2019-07-08 · unverdicted · novelty 6.0

Introduces a framework for constructing ML models via interpretable steps, generalizes standard proxies into a parametrized family of measures, and quantifies the accuracy-interpretability tradeoff via practical algorithms.

What Physics do Data-Driven MoCap-to-Radar Models Learn?

cs.LG · 2026-04-19 · unverdicted · novelty 6.0

Data-driven MoCap-to-radar models often fail to learn underlying physics despite low reconstruction error, with temporal attention proving critical for transformers to achieve physical consistency.

Reframing AI Loss of Control: What Control Is, How to Have It, How to Lose It

cs.CY · 2026-05-19 · conditional · novelty 5.0

Control is defined as setting plausible goals and reliably achieving them; on this definition, ordinary AI can already erode human control without superintelligence or takeover.

Visual Interaction with Deep Learning Models through Collaborative Semantic Inference

cs.HC · 2019-07-24 · unverdicted · novelty 5.0

Proposes the CSI framework for co-designing visual interactions and deep learning models to expose and allow semantic control over intermediate reasoning processes, shown in a summarization case study.

Optimal Explanations of Linear Models

cs.LG · 2019-07-08 · unverdicted · novelty 5.0

An optimization framework decomposes linear models into increasing-complexity sequences using coordinate updates to generate parametrized interpretability metrics.

A Human-Grounded Evaluation of SHAP for Alert Processing

cs.LG · 2019-07-07 · unverdicted · novelty 5.0

Human-grounded evaluation finds no significant performance improvement from adding SHAP explanations to model confidence scores in alert processing.

Seeing What Shouldn't Be There: Counterfactual GANs for Medical Image Attribution

cs.CV · 2026-05-06 · unverdicted · novelty 5.0

A cycle-consistent GAN generates counterfactual medical images to attribute classification decisions more comprehensively than standard saliency methods.

Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions

cs.CY · 2026-02-27 · reject · novelty 4.0

A position paper argues that post-hoc XAI explanations are unfaithful and paradoxical, proposing a shift to expert-based verification and certification of AI systems.

Generating Counterfactual and Contrastive Explanations using SHAP

cs.LG · 2019-06-21 · unverdicted · novelty 3.0

Model-agnostic SHAP-based pipeline for contrastive explanations and counterfactual datapoints, evaluated on IRIS, Wine Quality, and Mobile Features datasets.

Unexplainability and Incomprehensibility of Artificial Intelligence

cs.CY · 2019-06-20 · unverdicted · novelty 3.0

Advanced AI systems are unexplainable in full and produce explanations that humans cannot comprehend.

On the Semantic Interpretability of Artificial Intelligence Models

cs.AI · 2019-07-09 · unverdicted · novelty 2.0

This survey classifies semantic interpretability methods in AI models by nature and feature introduction, reviews user impact, and identifies remaining gaps.

citing papers explorer

Showing 20 of 20 citing papers.

Rashomon Sets and Model Multiplicity in Federated Learning cs.LG · 2026-02-10 · unverdicted · none · ref 42 · internal anchor
The work provides the first formal definitions of Rashomon sets for federated learning and introduces a multiplicity-aware training pipeline evaluated on standard benchmarks.
A Unified Theory of Sparse Dictionary Learning in Mechanistic Interpretability: Piecewise Biconvexity and Spurious Minima cs.LG · 2025-12-05 · unverdicted · none · ref 2 · internal anchor
A piecewise biconvex optimization framework unifies sparse dictionary learning variants, explains their pathologies via spurious optima, and enables feature anchoring to restore identifiability.
Agentic-imodels: Evolving agentic interpretability tools via autoresearch cs.AI · 2026-05-05 · unverdicted · none · ref 21
Agentic-imodels evolves scikit-learn regressors via an autoresearch loop to jointly boost predictive performance and LLM-simulatability, improving downstream agentic data science tasks by up to 73% on the BLADE benchmark.
Interpretable Human-Label-Free Deep Learning for Real-Bogus Classification with Uncertainty Quantification astro-ph.IM · 2026-07-06 · conditional · none · ref 34 · internal anchor
A dual-network asymmetric co-teaching framework trained on injected transients and contaminated survey data achieves human-label-free real-bogus classification with calibrated uncertainty.
Machine Learning Approaches for Improved Scalability of Metallic Magnetic Calorimeters physics.ins-det · 2026-06-23 · conditional · none · ref 62 · internal anchor
The paper demonstrates two ML pipelines — clustering-based artifact rejection and a CVAE-plus-regressor trained on simulations — that classify MMC pulses and extract energy spectra comparable to FIR filtering.
The Attribution Contract: Feature Attribution for Generative Language Models cs.LG · 2026-05-21 · unverdicted · none · ref 28 · internal anchor
The paper proposes the Attribution Contract as a framework to resolve conceptual ambiguities in applying feature attribution to autoregressive and diffusion language models by explicitly specifying what is being explained.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 173 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
On the definition and importance of interpretability in scientific machine learning cs.LG · 2025-05-16 · conditional · none · ref 44 · internal anchor
Interpretability in SciML requires mechanistic understanding rather than sparsity, and prior knowledge is often essential for interpretable scientific discovery.
Enabling Global, Human-Centered Explanations for LLMs:From Tokens to Interpretable Code and Test Generation cs.SE · 2025-03-21 · unverdicted · none · ref 36 · internal anchor
CodeQ aggregates token rationales into code categories to enable global interpretability of LLMs, claiming over 50% entropy reduction and revealing model preference for syntactic cues plus human misalignment in a 37-person study.
The Price of Interpretability cs.LG · 2019-07-08 · unverdicted · none · ref 29 · internal anchor
Introduces a framework for constructing ML models via interpretable steps, generalizes standard proxies into a parametrized family of measures, and quantifies the accuracy-interpretability tradeoff via practical algorithms.
What Physics do Data-Driven MoCap-to-Radar Models Learn? cs.LG · 2026-04-19 · unverdicted · none · ref 11
Data-driven MoCap-to-radar models often fail to learn underlying physics despite low reconstruction error, with temporal attention proving critical for transformers to achieve physical consistency.
Reframing AI Loss of Control: What Control Is, How to Have It, How to Lose It cs.CY · 2026-05-19 · conditional · none · ref 9 · internal anchor
Control is defined as setting plausible goals and reliably achieving them; on this definition, ordinary AI can already erode human control without superintelligence or takeover.
Visual Interaction with Deep Learning Models through Collaborative Semantic Inference cs.HC · 2019-07-24 · unverdicted · none · ref 58 · internal anchor
Proposes the CSI framework for co-designing visual interactions and deep learning models to expose and allow semantic control over intermediate reasoning processes, shown in a summarization case study.
Optimal Explanations of Linear Models cs.LG · 2019-07-08 · unverdicted · none · ref 6 · internal anchor
An optimization framework decomposes linear models into increasing-complexity sequences using coordinate updates to generate parametrized interpretability metrics.
A Human-Grounded Evaluation of SHAP for Alert Processing cs.LG · 2019-07-07 · unverdicted · none · ref 10 · internal anchor
Human-grounded evaluation finds no significant performance improvement from adding SHAP explanations to model confidence scores in alert processing.
Seeing What Shouldn't Be There: Counterfactual GANs for Medical Image Attribution cs.CV · 2026-05-06 · unverdicted · none · ref 31
A cycle-consistent GAN generates counterfactual medical images to attribute classification decisions more comprehensively than standard saliency methods.
Beyond Explainable AI (XAI): An Overdue Paradigm Shift and Post-XAI Research Directions cs.CY · 2026-02-27 · reject · none · ref 24 · internal anchor
A position paper argues that post-hoc XAI explanations are unfaithful and paradoxical, proposing a shift to expert-based verification and certification of AI systems.
Generating Counterfactual and Contrastive Explanations using SHAP cs.LG · 2019-06-21 · unverdicted · none · ref 6 · internal anchor
Model-agnostic SHAP-based pipeline for contrastive explanations and counterfactual datapoints, evaluated on IRIS, Wine Quality, and Mobile Features datasets.
Unexplainability and Incomprehensibility of Artificial Intelligence cs.CY · 2019-06-20 · unverdicted · none · ref 37 · internal anchor
Advanced AI systems are unexplainable in full and produce explanations that humans cannot comprehend.
On the Semantic Interpretability of Artificial Intelligence Models cs.AI · 2019-07-09 · unverdicted · none · ref 9 · internal anchor
This survey classifies semantic interpretability methods in AI models by nature and feature introduction, reviews user impact, and identifies remaining gaps.

The Mythos of Model Interpretability

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer