Domain-Filtered Knowledge Graphs from Sparse Autoencoder Features
Pith reviewed 2026-05-08 06:05 UTC · model grok-4.3
The pith
Sparse autoencoder features can be filtered and organized into domain-specific knowledge graphs that map a language model's internal concepts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that a strict domain-specific concept universe can be constructed from SAE features using contrastive activations and multi-stage filtering, after which two aligned graph views—a co-occurrence graph for conceptual structure and a transcoder-based mechanism graph linking features across layers—can be built and labeled to create readable knowledge graphs that map the model's internal knowledge.
What carries the argument
The domain-filtered knowledge graphs, consisting of a multi-granularity co-occurrence graph and a transcoder mechanism graph with automated edge labels, built on contrastively filtered SAE features.
If this is right
- The graphs recover coherent chapter and subchapter-level structure in a biology textbook domain.
- They reveal concepts that bridge neighboring topics.
- Messy sentence-level activity containing thousands of features transforms into compact, readable views of local model activity.
- This approach enables audits of reasoning faithfulness by providing a global map of model knowledge.
Where Pith is reading between the lines
- The mechanism graph could be used to trace specific feature pathways and explain how the model reaches particular outputs.
- Targeted changes to nodes or edges in these graphs might allow precise editing of the model's domain knowledge.
- The same filtering and graphing steps could apply to other domains to produce comparable internal maps for different models.
Load-bearing premise
The multi-stage filtering with contrastive activations and graph constructions accurately reflect the model's genuine domain concepts and relationships rather than artifacts from the SAE or the filtering steps.
What would settle it
If manual review or comparison to the source textbook shows that the graphs do not recover the chapter structure or contain many incorrect connections between unrelated concepts, the claim that they form accurate knowledge graphs would be falsified.
Figures
read the original abstract
Sparse autoencoders (SAEs) extract millions of interpretable features from a language model, but flat feature inventories aren't very useful on their own. Domain concepts get mixed with generic and weakly grounded features, while related ideas are scattered across many units, and there's no way to understand relationships between features. We address this by first constructing a strict domain-specific concept universe from a large SAE inventory using contrastive activations and a multi-stage filtering process. Next, we build two aligned graph views on the filtered set: a co-occurrence graph for corpus-level conceptual structure, organized at multiple levels of granularity, and a transcoder-based mechanism graph that links source-layer and target-layer features through sparse latent pathways. Automated edge labeling then turns these graph views into readable knowledge graphs rather than unlabeled layouts. In a case study on a biology textbook, these graphs recover coherent chapter and subchapter-level structure, reveal concepts that bridge neighboring topics, and transform messy sentence-level activity containing thousands of features into compact, readable views that illustrate the model's local activity. Taken together, this reframes a flat SAE inventory as an internal knowledge graph that converts feature-level interpretability into a global map of model knowledge and enables audits of reasoning faithfulness.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a pipeline to convert flat inventories of sparse autoencoder (SAE) features into domain-filtered knowledge graphs. It first applies contrastive activations and a multi-stage filtering process to isolate a domain-specific concept universe, then constructs two aligned graph representations—a co-occurrence graph capturing corpus-level conceptual structure at multiple granularities and a transcoder-based mechanism graph linking source- and target-layer features via sparse pathways—followed by automated edge labeling to produce readable graphs. A qualitative case study on a biology textbook is used to illustrate recovery of chapter and subchapter structure, identification of bridging concepts, and compact views of local model activity.
Significance. If the filtering and graph constructions can be shown to faithfully extract model-internal concepts and relations without substantial distortion, the work would meaningfully advance SAE-based interpretability by moving from isolated features to structured, auditable maps of model knowledge. The qualitative biology case study provides suggestive evidence that the graphs can surface coherent domain organization and support reasoning audits, but the absence of quantitative validation or ablations limits the strength of this contribution.
major comments (3)
- [§3 (Domain Filtering Pipeline)] The central claim that the multi-stage filtering plus co-occurrence and transcoder graphs produce a faithful internal knowledge graph rests on the untested assumption that these steps accurately capture the model's domain concepts without artifacts or information loss. No ablations are reported that isolate the effect of the contrastive activation step or compare filtered versus unfiltered feature sets against held-out model activations.
- [§5 (Biology Textbook Case Study)] In the biology textbook case study, the graphs are reported to recover chapter-level structure and bridging concepts. However, no quantitative metrics (e.g., precision/recall against the textbook's table of contents, edge overlap with external ontologies, or correlation with held-out activation patterns) or control experiments (e.g., graphs from random SAE features or corpus-only baselines) are provided to rule out that the observed structure arises from labeling heuristics or corpus organization rather than model internals.
- [§4 (Graph Construction)] The transcoder-based mechanism graph is presented as linking features through sparse latent pathways, yet no evaluation is given of how well these edges correspond to actual cross-layer computations (e.g., via activation patching or correlation on held-out data) or of the alignment quality between the co-occurrence and mechanism views.
minor comments (2)
- The abstract and introduction would benefit from an explicit statement of the evaluation criteria (qualitative only, or planned quantitative follow-ups) to set reader expectations.
- Figure captions for the graph visualizations should include legends clarifying node colors, edge types, and granularity levels to improve readability.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback. The comments correctly note the primarily qualitative nature of the current evidence and the value of additional ablations and metrics. We have revised the manuscript to incorporate ablations on the filtering pipeline, quantitative metrics and controls for the case study, and further evaluation of the mechanism graph edges. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [§3 (Domain Filtering Pipeline)] The central claim that the multi-stage filtering plus co-occurrence and transcoder graphs produce a faithful internal knowledge graph rests on the untested assumption that these steps accurately capture the model's domain concepts without artifacts or information loss. No ablations are reported that isolate the effect of the contrastive activation step or compare filtered versus unfiltered feature sets against held-out model activations.
Authors: We agree that explicit ablations strengthen the validation of the filtering pipeline. In the revised manuscript we have added an ablation that isolates the contrastive activation step by comparing it against a frequency-threshold baseline. We also compare the filtered feature set to the full unfiltered SAE inventory by measuring activation coverage on held-out biology textbook passages. These additions demonstrate that contrastive filtering retains the large majority of domain-relevant activations while substantially reducing generic features, supporting that the pipeline introduces limited artifacts. The new results and discussion appear in an expanded §3. revision: yes
-
Referee: [§5 (Biology Textbook Case Study)] In the biology textbook case study, the graphs are reported to recover chapter-level structure and bridging concepts. However, no quantitative metrics (e.g., precision/recall against the textbook's table of contents, edge overlap with external ontologies, or correlation with held-out activation patterns) or control experiments (e.g., graphs from random SAE features or corpus-only baselines) are provided to rule out that the observed structure arises from labeling heuristics or corpus organization rather than model internals.
Authors: The case study was designed as an illustrative demonstration of structure recovery and audit utility. We accept that quantitative support is needed. The revised version adds normalized mutual information between graph-derived communities and the textbook chapter hierarchy, plus a control experiment using random SAE feature subsets that yields substantially weaker structure recovery. Correlation with held-out activations is already embedded in the co-occurrence construction; we have clarified this and added a brief discussion of the practical limits of external ontology overlap. These changes are incorporated in §5. revision: partial
-
Referee: [§4 (Graph Construction)] The transcoder-based mechanism graph is presented as linking features through sparse latent pathways, yet no evaluation is given of how well these edges correspond to actual cross-layer computations (e.g., via activation patching or correlation on held-out data) or of the alignment quality between the co-occurrence and mechanism views.
Authors: The mechanism graph edges are produced by the transcoder's learned sparse pathways, which are optimized to reconstruct cross-layer activations. In revision we have added a correlation analysis between proposed edges and feature co-activations on held-out data, showing statistically significant alignment. Alignment quality between the co-occurrence and mechanism views is now quantified via edge overlap and node-centrality correlation within the biology case study. Exhaustive activation patching for every edge remains computationally prohibitive at the current scale and is noted as future work; the added correlation results are presented in §4. revision: partial
Circularity Check
No circularity: constructive pipeline from activations and features
full rationale
The paper describes a methodological pipeline that starts from an existing SAE feature inventory, applies contrastive activations for domain filtering, constructs co-occurrence and transcoder graphs, and performs automated labeling. These are forward constructive steps on external model outputs rather than any derivation that reduces to its own inputs by definition, fitted-parameter renaming, or self-citation load-bearing. No equations or claims in the abstract or described process exhibit self-definitional loops, uniqueness theorems imported from the authors, or ansatzes smuggled via prior work. The biology case study is presented as qualitative recovery, not a prediction forced by the construction itself. The derivation chain remains self-contained against the SAE activations and corpus data.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption SAE features represent interpretable concepts that can be isolated into domain-specific sets via contrastive activations and multi-stage filtering.
- domain assumption Co-occurrence patterns and transcoder pathways between layers reflect meaningful conceptual and mechanistic relationships suitable for graph representation.
Reference graph
Works this paper leans on
-
[1]
URL https://doi.org/10.1109/SCW63240
doi: 10.1109/SCW63240.2024.00269. URL https://doi.org/10.1109/SCW63240. 2024.00269. John Winnicki, Maris Arthurs, Tyler Anderson, Finn H. O’Shea, Maria Elena Monzani, and Eric Darve. Unsupervised learning techniques for identification of anomalous LZ wave- form data.EPJ Web of Conferences, 337:01122, 2025. doi: 10.1051/epjconf/202533701122. URLhttps://doi...
-
[2]
astatic mechanismfor each latent, which says what that latent is positioned to read from and write toward in the SAE dictionaries
-
[3]
adynamic mechanism executionon a selected analysis unit, which says which of those candidate mechanisms were actually used on that input
-
[4]
acompressed dynamic graph, which uses the semantic hierarchy to keep dense local mechanism graphs readable without hiding stored active edges. C.3.3 Functional source and target support are local linear proxies The source-side question is: which readable source features tend to support the firing of latentk? The latent preactivation is pi,k =r ⊤ k xsrc i ...
work page 1900
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.