Interpreting Language Models Through Concept Descriptions: A Survey

Laura Kopf; Nils Feldhus

arxiv: 2510.01048 · v1 · submitted 2025-10-01 · 💻 cs.CL · cs.AI· cs.LG

Interpreting Language Models Through Concept Descriptions: A Survey

Nils Feldhus , Laura Kopf This is my paper

Pith reviewed 2026-05-18 10:31 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords concept descriptionsmechanistic interpretabilitylarge language modelsevaluation metricssurveymodel componentssparse autoencoders

0 comments

The pith

Concept descriptions for language model components need more rigorous causal evaluation to explain model decisions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This survey reviews the use of generator models to create natural language descriptions that explain the roles of individual components like neurons and attention heads, as well as abstractions such as features from sparse autoencoders in large language models. It organizes the main generation methods, the automated and human-based metrics for judging those descriptions, and the datasets that support the work. The authors synthesize these elements to show that current practices fall short on causal validation, meaning tests that confirm whether a description truly drives the component's effect rather than merely correlating with it. A sympathetic reader would care because accurate descriptions could make the opaque decision processes inside powerful models more accessible and reliable. The paper positions this synthesis as a starting point for directing future efforts toward greater model transparency.

Core claim

The central claim is that a rapidly expanding body of research produces open-vocabulary natural language concept descriptions for model components and abstractions, yet the field as a whole now requires a shift to more rigorous, causal forms of evaluation to determine whether those descriptions accurately capture underlying mechanisms.

What carries the argument

Concept descriptions: natural language labels generated by powerful models for the functional roles of neurons, attention heads, and learned sparse features inside large language models.

Load-bearing premise

The papers collected in the survey form a representative sample of the methods, metrics, and datasets that define the current state of concept description research.

What would settle it

Publication of a major generation technique or evaluation protocol that was omitted from the survey and that alters the identified gaps in causal testing would show the synthesis is incomplete.

read the original abstract

Understanding the decision-making processes of neural networks is a central goal of mechanistic interpretability. In the context of Large Language Models (LLMs), this involves uncovering the underlying mechanisms and identifying the roles of individual model components such as neurons and attention heads, as well as model abstractions such as the learned sparse features extracted by Sparse Autoencoders (SAEs). A rapidly growing line of work tackles this challenge by using powerful generator models to produce open-vocabulary, natural language concept descriptions for these components. In this paper, we provide the first survey of the emerging field of concept descriptions for model components and abstractions. We chart the key methods for generating these descriptions, the evolving landscape of automated and human metrics for evaluating them, and the datasets that underpin this research. Our synthesis reveals a growing demand for more rigorous, causal evaluation. By outlining the state of the art and identifying key challenges, this survey provides a roadmap for future research toward making models more transparent.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This is the first survey mapping concept description methods for LLM components and it organizes the space usefully, but the claim of growing demand for causal evaluation rests on undocumented literature choices.

read the letter

This survey is the first to collect work on using generator models to produce natural language descriptions for neurons, attention heads, and SAE features in LLMs. It charts generation techniques, mixes of automated and human metrics, and the datasets behind them, then flags the need for stronger causal checks. That organization gives a practical entry point for people trying to track this slice of mechanistic interpretability work.

Referee Report

1 major / 1 minor

Summary. This survey reviews the emerging area of generating open-vocabulary natural language concept descriptions for LLM components (neurons, attention heads) and abstractions (SAE features) using generator models. It organizes the literature around generation methods, automated and human evaluation metrics, and supporting datasets, and synthesizes an observed trend toward more rigorous causal evaluation of these descriptions.

Significance. If the reviewed literature is representative, the survey would be a useful first map of a fast-moving subfield in mechanistic interpretability. Explicitly charting methods, metrics, and datasets while flagging the need for causal interventions could help standardize evaluation and reduce reliance on correlational proxies.

major comments (1)

[Abstract] Abstract: the central synthesis that 'our synthesis reveals a growing demand for more rigorous, causal evaluation' is load-bearing for the paper's contribution. The manuscript provides no search protocol, inclusion/exclusion criteria, date bounds, or handling of preprints versus peer-reviewed work, so it is impossible to determine whether the reported trend reflects field-wide developments or curation choices that over- or under-sample correlational work.

minor comments (1)

[Abstract] The abstract states that the paper is 'the first survey' of the field; a brief qualification of scope (e.g., focus on post-2022 work or English-language preprints) would prevent over-claiming.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which highlights an important aspect of methodological transparency in survey papers. We address the major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract] Abstract: the central synthesis that 'our synthesis reveals a growing demand for more rigorous, causal evaluation' is load-bearing for the paper's contribution. The manuscript provides no search protocol, inclusion/exclusion criteria, date bounds, or handling of preprints versus peer-reviewed work, so it is impossible to determine whether the reported trend reflects field-wide developments or curation choices that over- or under-sample correlational work.

Authors: We agree that explicitly documenting the literature selection process would allow readers to better evaluate whether the synthesized trend toward causal evaluation reflects field-wide developments. As this is the first survey of an emerging and fast-moving subfield, our coverage was based on identifying representative papers through targeted searches on arXiv, major conferences (NeurIPS, ICML, ICLR, ACL), and key terms related to concept descriptions, mechanistic interpretability, and evaluation of LLM components. However, we acknowledge the manuscript does not currently detail this process. In the revised version, we will add a dedicated subsection (likely in the introduction or a new Section 2) describing our search strategy, inclusion criteria (e.g., focus on open-vocabulary natural language descriptions of neurons, heads, or SAE features), exclusion criteria (e.g., works not providing human-interpretable descriptions), approximate date bounds, and rationale for including both preprints and peer-reviewed publications given the rapid evolution of the area. This will make the basis for our synthesis more transparent without altering the core contribution. revision: yes

Circularity Check

0 steps flagged

No significant circularity: survey aggregates external literature without internal reductions

full rationale

This is a literature survey paper containing no mathematical derivations, equations, predictions, fitted parameters, or model outputs. The central synthesis—that the field shows growing demand for rigorous causal evaluation—is presented as an aggregation of reviewed external works rather than a derivation that reduces to the paper's own inputs by construction. No self-definitional loops, fitted-input predictions, or load-bearing self-citation chains exist; claims rest on cited references that are independent of the present manuscript. The paper is self-contained in its survey role, with methodology and conclusions verifiable against the body of cited literature.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

As a survey paper, there are no free parameters, axioms, or invented entities introduced by the authors; the contribution is limited to compilation and synthesis of prior work.

pith-pipeline@v0.9.0 · 5687 in / 982 out tokens · 29706 ms · 2026-05-18T10:31:14.433161+00:00 · methodology

Interpreting Language Models Through Concept Descriptions: A Survey

Core claim

What carries the argument

Load-bearing premise

What would settle it

discussion (0)