Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features

Abeynaya Gnanasekaran; Eric Darve; John Winnicki

arxiv: 2604.23829 · v2 · submitted 2026-04-26 · 💻 cs.AI

Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features

John Winnicki , Abeynaya Gnanasekaran , Eric Darve This is my paper

Pith reviewed 2026-05-08 06:05 UTC · model grok-4.3

classification 💻 cs.AI

keywords sparse autoencodersknowledge graphsfeature interpretabilitydomain filteringco-occurrence graphstranscoder mechanismsmodel auditinglanguage models

0 comments

The pith

Sparse autoencoder features can be filtered and organized into domain-specific knowledge graphs that map a language model's internal concepts.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows how to convert a large collection of sparse autoencoder features into structured knowledge graphs by using contrastive activations to filter for domain-specific concepts and then constructing co-occurrence and mechanism graphs with automated labels. A sympathetic reader would care because this turns scattered feature interpretations into a connected view of how the model organizes knowledge, making it possible to see relationships and audit reasoning processes. In the biology textbook case, the method recovers coherent structures from chapters and subchapters while simplifying complex activation patterns into readable forms.

Core claim

The central claim is that a strict domain-specific concept universe can be constructed from SAE features using contrastive activations and multi-stage filtering, after which two aligned graph views—a co-occurrence graph for conceptual structure and a transcoder-based mechanism graph linking features across layers—can be built and labeled to create readable knowledge graphs that map the model's internal knowledge.

What carries the argument

The domain-filtered knowledge graphs, consisting of a multi-granularity co-occurrence graph and a transcoder mechanism graph with automated edge labels, built on contrastively filtered SAE features.

If this is right

The graphs recover coherent chapter and subchapter-level structure in a biology textbook domain.
They reveal concepts that bridge neighboring topics.
Messy sentence-level activity containing thousands of features transforms into compact, readable views of local model activity.
This approach enables audits of reasoning faithfulness by providing a global map of model knowledge.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The mechanism graph could be used to trace specific feature pathways and explain how the model reaches particular outputs.
Targeted changes to nodes or edges in these graphs might allow precise editing of the model's domain knowledge.
The same filtering and graphing steps could apply to other domains to produce comparable internal maps for different models.

Load-bearing premise

The multi-stage filtering with contrastive activations and graph constructions accurately reflect the model's genuine domain concepts and relationships rather than artifacts from the SAE or the filtering steps.

What would settle it

If manual review or comparison to the source textbook shows that the graphs do not recover the chapter structure or contain many incorrect connections between unrelated concepts, the claim that they form accurate knowledge graphs would be falsified.

Figures

Figures reproduced from arXiv: 2604.23829 by Abeynaya Gnanasekaran, Eric Darve, John Winnicki.

**Figure 1.** Figure 1: Shared-coordinate activation maps recover textbook structure at multiple scales. view at source ↗

**Figure 2.** Figure 2: Capability progression from diffuse sentence activity to a readable mechanism view at source ↗

**Figure 3.** Figure 3: Edge-labeled knowledge-graph views. Panel A shows a co-occurrence bridge view at source ↗

**Figure 4.** Figure 4: Full chapter-density atlas on shared sentence-graph coordinates. Each panel shows view at source ↗

**Figure 5.** Figure 5: Constructing the abstraction tree. (a) A mutual-kNN filter cleans the local neighborhood graph built from feature-description embeddings, which gives a better geometry for recursive splitting. (b) The resulting local groups are organized into progressively broader internal concepts, with summaries grounded by representatives and descendant leaf anchors. C Detailed hierarchy and mechanism methods C.1 The a… view at source ↗

**Figure 6.** Figure 6: Transcoder mechanisms viewed through readable SAE dictionaries. ( view at source ↗

**Figure 7.** Figure 7: Hierarchy-respecting compression of a dense mechanism graph. ( view at source ↗

read the original abstract

Sparse autoencoders (SAEs) extract millions of interpretable features from a language model, but flat feature inventories aren't very useful on their own. Domain concepts get mixed with generic and weakly grounded features, while related ideas are scattered across many units, and there's no way to understand relationships between features. We address this by first constructing a strict domain-specific concept universe from a large SAE inventory using contrastive activations and a multi-stage filtering process. Next, we build two aligned graph views on the filtered set: a co-occurrence graph for corpus-level conceptual structure, organized at multiple levels of granularity, and a transcoder-based mechanism graph that links source-layer and target-layer features through sparse latent pathways. Automated edge labeling then turns these graph views into readable knowledge graphs rather than unlabeled layouts. In a case study on a biology textbook, these graphs recover coherent chapter and subchapter-level structure, reveal concepts that bridge neighboring topics, and transform messy sentence-level activity containing thousands of features into compact, readable views that illustrate the model's local activity. Taken together, this reframes a flat SAE inventory as an internal knowledge graph that converts feature-level interpretability into a global map of model knowledge and enables audits of reasoning faithfulness.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper builds a filtering-plus-graph pipeline on top of SAE features to create domain-specific knowledge graphs, but the only evidence is a single qualitative biology case study with no ablations or metrics.

read the letter

The core contribution is a concrete pipeline that takes a large SAE feature inventory, applies contrastive multi-stage filtering to isolate a domain concept set, then constructs two aligned graphs: one from co-occurrence statistics at different granularities and one from transcoder pathways linking layers. Automated labeling turns the graphs into readable knowledge maps. In the biology textbook example the graphs recover chapter and subchapter structure plus bridging concepts, and they compress messy per-sentence feature activity into compact views. That is a practical step beyond flat feature lists, and the dual-graph design (corpus structure plus mechanism links) is a reasonable way to address both organization and cross-layer relationships at once. The filtering step itself looks like a straightforward extension of existing contrastive techniques rather than a wholly new invention, but the integration into readable, multi-view graphs is new enough to be worth examining. The main limitation is the evaluation. Everything rests on one qualitative case study with no quantitative metrics, no ablation of the contrastive or filtering stages, and no comparison against held-out activations or external ontologies. Without those checks it is difficult to tell whether the recovered structure reflects the model's actual internal knowledge or simply the organization of the textbook corpus plus labeling choices. The claims about enabling audits of reasoning faithfulness therefore sit on untested ground. This work is aimed at interpretability researchers already running SAEs who need better ways to navigate and audit feature sets at scale. It is coherent on its own terms and shows clear thinking about the limitations of flat inventories, so it deserves a serious referee who can ask for the missing quantitative validation and ablations. I would send it to review rather than desk reject.

Referee Report

3 major / 2 minor

Summary. The paper proposes a pipeline to convert flat inventories of sparse autoencoder (SAE) features into domain-filtered knowledge graphs. It first applies contrastive activations and a multi-stage filtering process to isolate a domain-specific concept universe, then constructs two aligned graph representations—a co-occurrence graph capturing corpus-level conceptual structure at multiple granularities and a transcoder-based mechanism graph linking source- and target-layer features via sparse pathways—followed by automated edge labeling to produce readable graphs. A qualitative case study on a biology textbook is used to illustrate recovery of chapter and subchapter structure, identification of bridging concepts, and compact views of local model activity.

Significance. If the filtering and graph constructions can be shown to faithfully extract model-internal concepts and relations without substantial distortion, the work would meaningfully advance SAE-based interpretability by moving from isolated features to structured, auditable maps of model knowledge. The qualitative biology case study provides suggestive evidence that the graphs can surface coherent domain organization and support reasoning audits, but the absence of quantitative validation or ablations limits the strength of this contribution.

major comments (3)

[§3 (Domain Filtering Pipeline)] The central claim that the multi-stage filtering plus co-occurrence and transcoder graphs produce a faithful internal knowledge graph rests on the untested assumption that these steps accurately capture the model's domain concepts without artifacts or information loss. No ablations are reported that isolate the effect of the contrastive activation step or compare filtered versus unfiltered feature sets against held-out model activations.
[§5 (Biology Textbook Case Study)] In the biology textbook case study, the graphs are reported to recover chapter-level structure and bridging concepts. However, no quantitative metrics (e.g., precision/recall against the textbook's table of contents, edge overlap with external ontologies, or correlation with held-out activation patterns) or control experiments (e.g., graphs from random SAE features or corpus-only baselines) are provided to rule out that the observed structure arises from labeling heuristics or corpus organization rather than model internals.
[§4 (Graph Construction)] The transcoder-based mechanism graph is presented as linking features through sparse latent pathways, yet no evaluation is given of how well these edges correspond to actual cross-layer computations (e.g., via activation patching or correlation on held-out data) or of the alignment quality between the co-occurrence and mechanism views.

minor comments (2)

The abstract and introduction would benefit from an explicit statement of the evaluation criteria (qualitative only, or planned quantitative follow-ups) to set reader expectations.
Figure captions for the graph visualizations should include legends clarifying node colors, edge types, and granularity levels to improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. The comments correctly note the primarily qualitative nature of the current evidence and the value of additional ablations and metrics. We have revised the manuscript to incorporate ablations on the filtering pipeline, quantitative metrics and controls for the case study, and further evaluation of the mechanism graph edges. Our point-by-point responses follow.

read point-by-point responses

Referee: [§3 (Domain Filtering Pipeline)] The central claim that the multi-stage filtering plus co-occurrence and transcoder graphs produce a faithful internal knowledge graph rests on the untested assumption that these steps accurately capture the model's domain concepts without artifacts or information loss. No ablations are reported that isolate the effect of the contrastive activation step or compare filtered versus unfiltered feature sets against held-out model activations.

Authors: We agree that explicit ablations strengthen the validation of the filtering pipeline. In the revised manuscript we have added an ablation that isolates the contrastive activation step by comparing it against a frequency-threshold baseline. We also compare the filtered feature set to the full unfiltered SAE inventory by measuring activation coverage on held-out biology textbook passages. These additions demonstrate that contrastive filtering retains the large majority of domain-relevant activations while substantially reducing generic features, supporting that the pipeline introduces limited artifacts. The new results and discussion appear in an expanded §3. revision: yes
Referee: [§5 (Biology Textbook Case Study)] In the biology textbook case study, the graphs are reported to recover chapter-level structure and bridging concepts. However, no quantitative metrics (e.g., precision/recall against the textbook's table of contents, edge overlap with external ontologies, or correlation with held-out activation patterns) or control experiments (e.g., graphs from random SAE features or corpus-only baselines) are provided to rule out that the observed structure arises from labeling heuristics or corpus organization rather than model internals.

Authors: The case study was designed as an illustrative demonstration of structure recovery and audit utility. We accept that quantitative support is needed. The revised version adds normalized mutual information between graph-derived communities and the textbook chapter hierarchy, plus a control experiment using random SAE feature subsets that yields substantially weaker structure recovery. Correlation with held-out activations is already embedded in the co-occurrence construction; we have clarified this and added a brief discussion of the practical limits of external ontology overlap. These changes are incorporated in §5. revision: partial
Referee: [§4 (Graph Construction)] The transcoder-based mechanism graph is presented as linking features through sparse latent pathways, yet no evaluation is given of how well these edges correspond to actual cross-layer computations (e.g., via activation patching or correlation on held-out data) or of the alignment quality between the co-occurrence and mechanism views.

Authors: The mechanism graph edges are produced by the transcoder's learned sparse pathways, which are optimized to reconstruct cross-layer activations. In revision we have added a correlation analysis between proposed edges and feature co-activations on held-out data, showing statistically significant alignment. Alignment quality between the co-occurrence and mechanism views is now quantified via edge overlap and node-centrality correlation within the biology case study. Exhaustive activation patching for every edge remains computationally prohibitive at the current scale and is noted as future work; the added correlation results are presented in §4. revision: partial

Circularity Check

0 steps flagged

No circularity: constructive pipeline from activations and features

full rationale

The paper describes a methodological pipeline that starts from an existing SAE feature inventory, applies contrastive activations for domain filtering, constructs co-occurrence and transcoder graphs, and performs automated labeling. These are forward constructive steps on external model outputs rather than any derivation that reduces to its own inputs by definition, fitted-parameter renaming, or self-citation load-bearing. No equations or claims in the abstract or described process exhibit self-definitional loops, uniqueness theorems imported from the authors, or ansatzes smuggled via prior work. The biology case study is presented as qualitative recovery, not a prediction forced by the construction itself. The derivation chain remains self-contained against the SAE activations and corpus data.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The approach rests on standard interpretability assumptions about feature meaning and graph utility rather than new postulates; insufficient abstract detail to list specific free parameters.

axioms (2)

domain assumption SAE features represent interpretable concepts that can be isolated into domain-specific sets via contrastive activations and multi-stage filtering.
This underpins the initial construction of the strict domain-specific concept universe.
domain assumption Co-occurrence patterns and transcoder pathways between layers reflect meaningful conceptual and mechanistic relationships suitable for graph representation.
This justifies building and labeling the two aligned graph views.

pith-pipeline@v0.9.0 · 5515 in / 1303 out tokens · 78963 ms · 2026-05-08T06:05:38.722535+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages

[1]

URL https://doi.org/10.1109/SCW63240

doi: 10.1109/SCW63240.2024.00269. URL https://doi.org/10.1109/SCW63240. 2024.00269. John Winnicki, Maris Arthurs, Tyler Anderson, Finn H. O’Shea, Maria Elena Monzani, and Eric Darve. Unsupervised learning techniques for identification of anomalous LZ wave- form data.EPJ Web of Conferences, 337:01122, 2025. doi: 10.1051/epjconf/202533701122. URLhttps://doi...

work page doi:10.1109/scw63240.2024.00269 2024
[2]

astatic mechanismfor each latent, which says what that latent is positioned to read from and write toward in the SAE dictionaries

work page
[3]

adynamic mechanism executionon a selected analysis unit, which says which of those candidate mechanisms were actually used on that input

work page
[4]

yellow” and “national parks

acompressed dynamic graph, which uses the semantic hierarchy to keep dense local mechanism graphs readable without hiding stored active edges. C.3.3 Functional source and target support are local linear proxies The source-side question is: which readable source features tend to support the firing of latentk? The latent preactivation is pi,k =r ⊤ k xsrc i ...

work page 1900

[1] [1]

URL https://doi.org/10.1109/SCW63240

doi: 10.1109/SCW63240.2024.00269. URL https://doi.org/10.1109/SCW63240. 2024.00269. John Winnicki, Maris Arthurs, Tyler Anderson, Finn H. O’Shea, Maria Elena Monzani, and Eric Darve. Unsupervised learning techniques for identification of anomalous LZ wave- form data.EPJ Web of Conferences, 337:01122, 2025. doi: 10.1051/epjconf/202533701122. URLhttps://doi...

work page doi:10.1109/scw63240.2024.00269 2024

[2] [2]

astatic mechanismfor each latent, which says what that latent is positioned to read from and write toward in the SAE dictionaries

work page

[3] [3]

adynamic mechanism executionon a selected analysis unit, which says which of those candidate mechanisms were actually used on that input

work page

[4] [4]

yellow” and “national parks

acompressed dynamic graph, which uses the semantic hierarchy to keep dense local mechanism graphs readable without hiding stored active edges. C.3.3 Functional source and target support are local linear proxies The source-side question is: which readable source features tend to support the firing of latentk? The latent preactivation is pi,k =r ⊤ k xsrc i ...

work page 1900