hub

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

Been Kim, Martin Wattenberg, Justin Gilmer, Carrie Cai, James Wexler, Fernanda Viegas, Rory Sayres · 2017 · stat.ML · arXiv 1711.11279

11 Pith papers cite this work. Polarity classification is still indexing.

11 Pith papers citing it

open full Pith review browse 11 citing papers arXiv PDF

abstract

The interpretation of deep learning models is a challenge due to their size, complexity, and often opaque internal state. In addition, many systems, such as image classifiers, operate on low-level features rather than high-level concepts. To address these challenges, we introduce Concept Activation Vectors (CAVs), which provide an interpretation of a neural net's internal state in terms of human-friendly concepts. The key idea is to view the high-dimensional internal state of a neural net as an aid, not an obstacle. We show how to use CAVs as part of a technique, Testing with CAVs (TCAV), that uses directional derivatives to quantify the degree to which a user-defined concept is important to a classification result--for example, how sensitive a prediction of "zebra" is to the presence of stripes. Using the domain of image classification as a testing ground, we describe how CAVs may be used to explore hypotheses and generate insights for a standard image classification network as well as a medical application.

hub tools

JSON dossier citing papers JSON arXiv source

citation-role summary

background 1

citation-polarity summary

background 1

representative citing papers

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability

cs.LG · 2026-05-14 · unverdicted · novelty 7.0

Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.

Unifying machine learning and quantum chemistry -- a deep neural network for molecular wavefunctions

physics.chem-ph · 2019-06-24 · unverdicted · novelty 7.0

Deep neural network predicts molecular wavefunctions in atomic orbital basis from which quantum properties are derived at force-field efficiency.

From Mechanistic to Compositional Interpretability

cs.LG · 2026-05-09 · unverdicted · novelty 7.0

Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.

Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces

cs.LG · 2026-05-12 · unverdicted · novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.

UNBOX: Unveiling Black-box visual models with Natural-language

cs.CV · 2026-03-09 · unverdicted · novelty 6.0

UNBOX recovers interpretable text concepts that maximally activate classes in black-box vision models by recasting activation maximization as semantic search with LLMs and diffusion models.

FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry

cs.LG · 2026-05-11 · unverdicted · novelty 6.0

Linear mappings in feature space can reconstruct a wide range of image manipulations including semantic edits, suggesting that feature representations are approximately linearly organized.

Finding Meaning in Embeddings: Concept Separation Curves

cs.CL · 2026-04-23 · unverdicted · novelty 6.0

Concept Separation Curves provide a classifier-independent method to visualize and quantify how sentence embeddings distinguish conceptual meaning from syntactic variations across languages and domains.

Generative Counterfactual Introspection for Explainable Deep Learning

cs.LG · 2019-07-06 · unverdicted · novelty 5.0

A generative-model-driven introspection method produces counterfactual image edits to explain deep neural network predictions on MNIST and CelebA.

Explainable Artificial Intelligence Techniques for Interpretation of Food Models: a Review

cs.AI · 2025-04-12 · unverdicted · novelty 3.0

A survey proposing a taxonomy of XAI techniques for food quality research organized by data types and explanation methods.

Unexplainability and Incomprehensibility of Artificial Intelligence

cs.CY · 2019-06-20 · unverdicted · novelty 3.0

Advanced AI systems are unexplainable in full and produce explanations that humans cannot comprehend.

CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models

cs.CL · 2026-05-19

citing papers explorer

Showing 11 of 11 citing papers.

When Are Two Networks the Same? Tensor Similarity for Mechanistic Interpretability cs.LG · 2026-05-14 · unverdicted · none · ref 15 · internal anchor
Tensor similarity is a symmetry-invariant metric that measures functional equivalence between tensor-based networks using a recursive algorithm for cross-layer mechanisms.
Unifying machine learning and quantum chemistry -- a deep neural network for molecular wavefunctions physics.chem-ph · 2019-06-24 · unverdicted · none · ref 32 · internal anchor
Deep neural network predicts molecular wavefunctions in atomic orbital basis from which quantum properties are derived at force-field efficiency.
From Mechanistic to Compositional Interpretability cs.LG · 2026-05-09 · unverdicted · none · ref 193
Compositional interpretability defines explanations as commuting syntactic-semantic mapping pairs grounded in compositionality and minimum description length, with compressive refinement and a parsimony theorem guaranteeing concise human-aligned decompositions.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces cs.LG · 2026-05-12 · unverdicted · none · ref 298 · internal anchor
A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
UNBOX: Unveiling Black-box visual models with Natural-language cs.CV · 2026-03-09 · unverdicted · none · ref 26 · internal anchor
UNBOX recovers interpretable text concepts that maximally activate classes in black-box vision models by recasting activation maximization as semantic search with LLMs and diffusion models.
FeatMap: Understanding image manipulation in the feature space and its implications for feature space geometry cs.LG · 2026-05-11 · unverdicted · none · ref 8
Linear mappings in feature space can reconstruct a wide range of image manipulations including semantic edits, suggesting that feature representations are approximately linearly organized.
Finding Meaning in Embeddings: Concept Separation Curves cs.CL · 2026-04-23 · unverdicted · none · ref 26
Concept Separation Curves provide a classifier-independent method to visualize and quantify how sentence embeddings distinguish conceptual meaning from syntactic variations across languages and domains.
Generative Counterfactual Introspection for Explainable Deep Learning cs.LG · 2019-07-06 · unverdicted · none · ref 19 · internal anchor
A generative-model-driven introspection method produces counterfactual image edits to explain deep neural network predictions on MNIST and CelebA.
Explainable Artificial Intelligence Techniques for Interpretation of Food Models: a Review cs.AI · 2025-04-12 · unverdicted · none · ref 237 · internal anchor
A survey proposing a taxonomy of XAI techniques for food quality research organized by data types and explanation methods.
Unexplainability and Incomprehensibility of Artificial Intelligence cs.CY · 2019-06-20 · unverdicted · none · ref 50 · internal anchor
Advanced AI systems are unexplainable in full and produce explanations that humans cannot comprehend.
CLIF: Concept-Level Influence Functions for Transparent Bottleneck Models cs.CL · 2026-05-19 · unreviewed · ref 12 · internal anchor

Interpretability Beyond Feature Attribution: Quantitative Testing with Concept Activation Vectors (TCAV)

hub tools

citation-role summary

citation-polarity summary

fields

years

verdicts

roles

polarities

representative citing papers

citing papers explorer