FADE: Why Bad Descriptions Happen to Good Features

Aakriti Jain; Bruno Puri; Elena Golimblevskaia; Patrick Kahardipraja; Sebastian Lapuschkin; Thomas Wiegand; Wojciech Samek

arxiv: 2502.16994 · v2 · submitted 2025-02-24 · 💻 cs.LG · cs.AI· cs.CL

FADE: Why Bad Descriptions Happen to Good Features

Bruno Puri , Aakriti Jain , Elena Golimblevskaia , Patrick Kahardipraja , Thomas Wiegand , Wojciech Samek , Sebastian Lapuschkin This is my paper

Pith reviewed 2026-05-23 01:51 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords feature descriptionsmechanistic interpretabilitysparse autoencodersMLP neuronsevaluation frameworkautomated interpretabilityLLM features

0 comments

The pith

A new framework called FADE shows that feature descriptions often fail to match the actual behavior of features extracted from language models, with larger gaps for sparse autoencoders than for MLP neurons.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces FADE as a model-agnostic method to score how well a description aligns with a feature's activations. It measures this alignment through four metrics and uses them to diagnose why descriptions diverge from the features they are meant to explain. When applied to existing open-source descriptions, the framework finds consistent shortfalls that are more pronounced in sparse autoencoders. A reader would care because reliable automated descriptions are a prerequisite for scaling mechanistic interpretability of large models. The work therefore quantifies the sources of misalignment to identify concrete limits in current description-generation pipelines.

Core claim

FADE is a scalable framework that evaluates feature-to-description alignment across Clarity, Responsiveness, Purity, and Faithfulness and, when applied to existing automated descriptions, reveals fundamental challenges in generating accurate feature descriptions, particularly for SAEs compared to MLP neurons.

What carries the argument

The FADE framework, which scores feature-description pairs on four metrics (Clarity, Responsiveness, Purity, Faithfulness) to quantify causes of misalignment.

If this is right

Current automated methods produce descriptions that systematically diverge from feature activation patterns.
Sparse autoencoders exhibit larger description misalignments than MLP neurons under the same evaluation.
The four-metric breakdown can be used to isolate which components of description-generation pipelines need improvement.
Insights from the metrics point toward specific limitations that future automated interpretability methods must address.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the metrics remain stable across new models, they could serve as an ongoing benchmark for testing description generators before deployment.
The framework might be applied to compare description quality across different feature-extraction techniques beyond SAEs and MLPs.
Quantifying misalignment sources could guide the design of hybrid human-AI description workflows that target the weakest metric first.

Load-bearing premise

The four metrics Clarity, Responsiveness, Purity, and Faithfulness together provide a valid, non-redundant, and bias-free quantification of feature-description alignment that generalizes across models and feature types.

What would settle it

A study in which human raters independently judge the accuracy of many feature-description pairs and the resulting human scores show no correlation with the four FADE metrics would falsify the claim that the metrics measure alignment.

read the original abstract

Recent advances in mechanistic interpretability have highlighted the potential of automating interpretability pipelines in analyzing the latent representations within LLMs. While this may enhance our understanding of internal mechanisms, the field lacks standardized evaluation methods for assessing the validity of discovered features. We attempt to bridge this gap by introducing FADE: Feature Alignment to Description Evaluation, a scalable model-agnostic framework for automatically evaluating feature-to-description alignment. FADE evaluates alignment across four key metrics - Clarity, Responsiveness, Purity, and Faithfulness - and systematically quantifies the causes of the misalignment between features and their descriptions. We apply FADE to analyze existing open-source feature descriptions and assess key components of automated interpretability pipelines, aiming to enhance the quality of descriptions. Our findings highlight fundamental challenges in generating feature descriptions, particularly for SAEs compared to MLP neurons, providing insights into the limitations and future directions of automated interpretability. We release FADE as an open-source package at: https://github.com/brunibrun/FADE

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces FADE, a scalable model-agnostic framework for evaluating alignment between LLM features and their descriptions via four metrics (Clarity, Responsiveness, Purity, Faithfulness). It applies the framework to existing open-source feature descriptions, identifies fundamental challenges in description generation (worse for SAEs than MLP neurons), and releases an open-source implementation.

Significance. If the metrics are shown to be robust and non-redundant, FADE would address a recognized gap by providing a standardized, automated evaluation tool for automated interpretability pipelines, with potential to guide improvements in description quality.

major comments (2)

[§3] §3: The four metrics are defined, but the manuscript provides no ablation (correlation matrix, PCA, or variance decomposition) demonstrating they are non-redundant, no sensitivity analysis to prompt phrasing or description length, and no human correlation study establishing that the scores track ground-truth alignment. Because the SAE-vs-MLP gap and the central claim of 'fundamental challenges' rest entirely on these scores, the absence of such validation makes the quantitative findings difficult to interpret.
[Results section] Results section (application of FADE): The reported differences between SAEs and MLP neurons are presented as evidence of inherent limitations, yet without controls for metric artifacts (e.g., whether any metric correlates with description length or model-specific output statistics), it is unclear whether the gap is an interpretability phenomenon or a consequence of how the metrics are constructed.

minor comments (2)

[Abstract] The abstract states the framework is 'model-agnostic' but does not specify the range of models or feature extractors tested; adding this detail would strengthen the generality claim.
The GitHub link is provided, but the manuscript does not include a brief reproducibility checklist (exact prompts, model versions, and data splits used for the reported experiments).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback. The comments highlight important aspects of metric validation and result interpretation that we address below with planned revisions to the manuscript.

read point-by-point responses

Referee: [§3] The four metrics are defined, but the manuscript provides no ablation (correlation matrix, PCA, or variance decomposition) demonstrating they are non-redundant, no sensitivity analysis to prompt phrasing or description length, and no human correlation study establishing that the scores track ground-truth alignment. Because the SAE-vs-MLP gap and the central claim of 'fundamental challenges' rest entirely on these scores, the absence of such validation makes the quantitative findings difficult to interpret.

Authors: We agree that explicit validation of the metrics would strengthen the claims. In the revised manuscript we will add a correlation matrix and PCA decomposition across the four metrics on the evaluated descriptions to demonstrate non-redundancy. We will also include sensitivity analyses with respect to description length. A comprehensive human correlation study lies outside the current scope given the scale of the automated evaluation; we will explicitly discuss this as a limitation and note it as valuable future work. These additions will better support the quantitative results while preserving the paper's focus on the scalable framework. revision: partial
Referee: [Results section] Results section (application of FADE): The reported differences between SAEs and MLP neurons are presented as evidence of inherent limitations, yet without controls for metric artifacts (e.g., whether any metric correlates with description length or model-specific output statistics), it is unclear whether the gap is an interpretability phenomenon or a consequence of how the metrics are constructed.

Authors: We will revise the results section to include explicit controls for potential artifacts, specifically reporting correlations between each metric and description length as well as other model-specific output statistics. This will allow readers to assess whether the SAE–MLP gap reflects differences in alignment or arises from metric construction, thereby clarifying the interpretation of the findings. revision: yes

Circularity Check

0 steps flagged

No significant circularity in FADE metric definitions or claims

full rationale

The paper introduces FADE as a new evaluation framework consisting of four metrics (Clarity, Responsiveness, Purity, Faithfulness) applied to existing feature descriptions from open-source sources. It then uses the resulting scores to compare SAEs against MLP neurons and identify challenges in automated interpretability. No equations, self-citations, or definitions are provided in the manuscript that reduce any metric or central claim to a tautology, fitted parameter, or prior self-citation chain. The metrics are presented as independent, scalable measures of alignment without evidence that any is constructed from the same descriptions being scored. The derivation is therefore self-contained; the reported SAE-vs-MLP gap follows from applying the stated metrics rather than from any definitional loop.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only review; the ledger is populated from the stated claims and cannot be audited against the full methods or derivations.

axioms (1)

domain assumption The four metrics Clarity, Responsiveness, Purity, and Faithfulness capture the essential dimensions of feature-description alignment.
Invoked by the definition of the FADE framework itself.

pith-pipeline@v0.9.0 · 5729 in / 1239 out tokens · 32188 ms · 2026-05-23T01:51:19.668203+00:00 · methodology

FADE: Why Bad Descriptions Happen to Good Features

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)