Explaining Explanations: An Overview of Interpretability of Machine Learning

· 2018 · cs.AI · arXiv 1806.00069

5 Pith papers cite this work. Polarity classification is still indexing.

5 Pith papers citing it

open full Pith review browse 5 citing papers arXiv PDF

abstract

There has recently been a surge of work in explanatory artificial intelligence (XAI). This research area tackles the important problem that complex machines and algorithms often cannot provide insights into their behavior and thought processes. XAI allows users and parts of the internal system to be more transparent, providing explanations of their decisions in some level of detail. These explanations are important to ensure algorithmic fairness, identify potential bias/problems in the training data, and to ensure that the algorithms perform as expected. However, explanations produced by these systems is neither standardized nor systematically assessed. In an effort to create best practices and identify open challenges, we provide our definition of explainability and show how it can be used to classify existing literature. We discuss why current approaches to explanatory methods especially for deep neural networks are insufficient. Finally, based on our survey, we conclude with suggested future research directions for explanatory artificial intelligence.

representative citing papers

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

On the definition and importance of interpretability in scientific machine learning

cs.LG · 2025-05-16 · conditional · novelty 6.0

Interpretability in SciML requires mechanistic understanding rather than sparsity, and prior knowledge is often essential for interpretable scientific discovery.

The Price of Interpretability

cs.LG · 2019-07-08 · unverdicted · novelty 6.0

Introduces a framework for constructing ML models via interpretable steps, generalizes standard proxies into a parametrized family of measures, and quantifies the accuracy-interpretability tradeoff via practical algorithms.

Optimal Explanations of Linear Models

cs.LG · 2019-07-08 · unverdicted · novelty 5.0

An optimization framework decomposes linear models into increasing-complexity sequences using coordinate updates to generate parametrized interpretability metrics.

Do Transformer Attention Heads Provide Transparency in Abstractive Summarization?

cs.CL · 2019-07-01 · unverdicted · novelty 5.0

Analysis of transformer attention heads in abstractive summarization shows specialization in some heads and proposes a method to measure model reliance on learned attention distributions.

citing papers explorer

Showing 5 of 5 citing papers.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 161 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
On the definition and importance of interpretability in scientific machine learning cs.LG · 2025-05-16 · conditional · none · ref 27 · internal anchor
Interpretability in SciML requires mechanistic understanding rather than sparsity, and prior knowledge is often essential for interpretable scientific discovery.
The Price of Interpretability cs.LG · 2019-07-08 · unverdicted · none · ref 18 · internal anchor
Introduces a framework for constructing ML models via interpretable steps, generalizes standard proxies into a parametrized family of measures, and quantifies the accuracy-interpretability tradeoff via practical algorithms.
Optimal Explanations of Linear Models cs.LG · 2019-07-08 · unverdicted · none · ref 27 · internal anchor
An optimization framework decomposes linear models into increasing-complexity sequences using coordinate updates to generate parametrized interpretability metrics.
Do Transformer Attention Heads Provide Transparency in Abstractive Summarization? cs.CL · 2019-07-01 · unverdicted · none · ref 7 · internal anchor
Analysis of transformer attention heads in abstractive summarization shows specialization in some heads and proposes a method to measure model reliance on learned attention distributions.

Explaining Explanations: An Overview of Interpretability of Machine Learning

fields

years

verdicts

representative citing papers

citing papers explorer