Causal Concept Graphs in LLM Latent Space for Stepwise Reasoning
Pith reviewed 2026-05-15 12:54 UTC · model grok-4.3
The pith
Causal graphs over sparse autoencoder features in LLMs produce stronger downstream effects from interventions than tracing or ranking baselines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Causal Concept Graphs are directed acyclic graphs whose nodes are sparse latent features from task-conditioned autoencoders and whose edges are recovered by DAGMA-style differentiable structure learning to encode causal dependencies; graph-guided interventions on these dependencies produce larger measurable effects on reasoning performance than baseline selection methods, as quantified by the Causal Fidelity Score.
What carries the argument
Directed acyclic graph over task-conditioned sparse autoencoder features whose edges are recovered by differentiable structure learning to represent causal concept dependencies.
Load-bearing premise
The structure-learning procedure applied to the sparse features recovers genuine causal dependencies among concepts rather than spurious correlations induced by the model activations.
What would settle it
Run paired interventions on the same reasoning examples: compare the size of output changes when editing features selected by the learned graph versus an equal number of randomly chosen features; absence of consistently larger effects under the graph would falsify the claim.
read the original abstract
Sparse autoencoders can localize where concepts live in language models, but not how they interact during multi-step reasoning. We propose Causal Concept Graphs (CCG): a directed acyclic graph over sparse, interpretable latent features, where edges capture learned causal dependencies between concepts. We combine task-conditioned sparse autoencoders for concept discovery with DAGMA-style differentiable structure learning for graph recovery and introduce the Causal Fidelity Score (CFS) to evaluate whether graph-guided interventions induce larger downstream effects than random ones. On ARC-Challenge, StrategyQA, and LogiQA with GPT-2 Medium, across five seeds ($n{=}15$ paired runs), CCG achieves $\CFS=5.654\pm0.625$, outperforming ROME-style tracing ($3.382\pm0.233$), SAE-only ranking ($2.479\pm0.196$), and a random baseline ($1.032\pm0.034$), with $p<0.0001$ after Bonferroni correction. Learned graphs are sparse (5-6\% edge density), domain-specific, and stable across seeds.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Causal Concept Graphs (CCG) as directed acyclic graphs over sparse, interpretable latent features discovered via task-conditioned sparse autoencoders, with DAGMA-style differentiable structure learning used to recover edges representing causal dependencies between concepts. It introduces the Causal Fidelity Score (CFS) to quantify whether graph-guided interventions produce larger downstream effects than random or baseline interventions, and reports that CCG achieves CFS=5.654±0.625 on ARC-Challenge, StrategyQA, and LogiQA using GPT-2 Medium (n=15 paired runs across five seeds), outperforming ROME-style tracing (3.382±0.233), SAE-only ranking (2.479±0.196), and random baseline (1.032±0.034) with p<0.0001 after Bonferroni correction. Learned graphs are described as sparse (5-6% edge density), domain-specific, and stable across seeds.
Significance. If the recovered graphs encode genuine causal dependencies whose interventions demonstrably affect downstream reasoning, the framework could advance mechanistic interpretability by moving beyond feature localization to explicit modeling of concept interactions in multi-step LLM reasoning. The reported statistical outperformance and graph sparsity would then represent a concrete improvement over existing tracing and ranking methods, with potential applications in controlled editing and verification of reasoning chains.
major comments (3)
- [Abstract] Abstract: The definition, exact formula, and intervention protocol for the Causal Fidelity Score (CFS) are not provided. This is load-bearing for the central claim, as the reported superiority (5.654 vs. 3.382/2.479) cannot be evaluated without knowing how interventions are performed on the learned graphs and whether the metric introduces circular dependence on the graph structure itself.
- [Abstract] Abstract: No details are given on the implementation of task-conditioning for the sparse autoencoders, the regularization or hyperparameters in the DAGMA-style structure learning, or any validation that recovered edges reflect true causal dependencies rather than spurious correlations induced by conditioning or latent confounding. This assumption is central to interpreting the CFS gains as evidence of valid causal graphs.
- [Abstract] Abstract: The statistical results cite p<0.0001 after Bonferroni correction for n=15 paired runs but omit the specific test statistic, degrees of freedom, and confirmation that paired-run assumptions hold across the three benchmarks; without these, the strength of evidence for outperformance cannot be assessed.
minor comments (1)
- [Abstract] Abstract: The notation CFS is introduced via the acronym but the inline mathematical definition is absent, which would aid immediate readability even in a concise abstract.
Simulated Author's Rebuttal
We thank the referee for the careful review and for highlighting areas where the abstract requires greater precision to support the central claims. We agree that the abstract as currently written omits key definitional and statistical details. We will revise the abstract to incorporate concise descriptions of the CFS formula, intervention protocol, task-conditioning approach, hyperparameters, and full statistical reporting. Our point-by-point responses follow.
read point-by-point responses
-
Referee: [Abstract] Abstract: The definition, exact formula, and intervention protocol for the Causal Fidelity Score (CFS) are not provided. This is load-bearing for the central claim, as the reported superiority (5.654 vs. 3.382/2.479) cannot be evaluated without knowing how interventions are performed on the learned graphs and whether the metric introduces circular dependence on the graph structure itself.
Authors: We agree that the abstract must supply the CFS definition and protocol. In the revision we will add the following sentence to the abstract: 'CFS is defined as the ratio of average downstream task-performance change under graph-guided interventions (ablating SAE latents for concepts selected by the learned DAG) to the change under random ablation of the same number of latents; interventions are performed on held-out test data after graph learning to avoid circularity.' This directly addresses evaluability and dependence concerns. revision: yes
-
Referee: [Abstract] Abstract: No details are given on the implementation of task-conditioning for the sparse autoencoders, the regularization or hyperparameters in the DAGMA-style structure learning, or any validation that recovered edges reflect true causal dependencies rather than spurious correlations induced by conditioning or latent confounding. This assumption is central to interpreting the CFS gains as evidence of valid causal graphs.
Authors: We acknowledge the absence of these details from the abstract. We will revise the abstract to state that task-conditioning is implemented by concatenating task embeddings to SAE inputs, that DAGMA employs standard L1 regularization with cross-validated hyperparameters, and that causal validity is supported by the statistically superior CFS scores relative to ROME and SAE-only baselines together with cross-seed stability. The methods section already contains the full implementation; the abstract revision will summarize it. revision: yes
-
Referee: [Abstract] Abstract: The statistical results cite p<0.0001 after Bonferroni correction for n=15 paired runs but omit the specific test statistic, degrees of freedom, and confirmation that paired-run assumptions hold across the three benchmarks; without these, the strength of evidence for outperformance cannot be assessed.
Authors: The referee is correct that the abstract lacks the test statistic, degrees of freedom, and explicit confirmation of paired assumptions. We will revise the abstract to report that a paired t-test was performed on the 15 matched runs (5 seeds across 3 benchmarks), state the resulting t-statistic and df, and confirm that pairing is valid because each run uses identical seeds and data partitions for all compared methods. The corrected p-value after Bonferroni adjustment will remain as stated. revision: yes
Circularity Check
No significant circularity in the derivation chain
full rationale
The abstract describes CCG as combining task-conditioned SAEs with DAGMA-style structure learning to recover a DAG over latent features, then introduces CFS to compare graph-guided interventions against random baselines. No equations, self-citations, or fitted-parameter renamings are present that reduce the reported CFS superiority (5.654 vs. 3.382/2.479) to the inputs by construction. The metric is defined as an external test of intervention effects on held-out task performance and is compared to independent methods (ROME tracing, SAE ranking, random), rendering the result falsifiable rather than tautological. The structure-learning step could in principle recover only spurious correlations, but that is a correctness concern, not a definitional reduction.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption DAGMA-style differentiable structure learning recovers causal dependencies from latent activations of task-conditioned sparse autoencoders
invented entities (1)
-
Causal Concept Graphs
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.