Addressing divergent representations from causal interventions on neural networks

Alexa R. Tartaglini; Christopher Potts; Satchel Grant; Simon Jerome Han

arxiv: 2511.04638 · v5 · submitted 2025-11-06 · 💻 cs.LG · cs.AI

Addressing divergent representations from causal interventions on neural networks

Satchel Grant , Simon Jerome Han , Alexa R. Tartaglini , Christopher Potts This is my paper

Pith reviewed 2026-05-18 00:38 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords causal interventionsmechanistic interpretabilityneural network representationsdivergent representationscounterfactual latent lossout-of-distribution shiftsbehavioral null space

0 comments

The pith

Causal interventions on neural networks often shift internal representations outside the model's natural distribution and can trigger unexpected behavioral changes.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that standard causal interventions move representations away from the distribution the target model normally produces. These shifts split into harmless cases that stay in the behavioral null space and pernicious cases that activate unused pathways and alter model behavior in dormant ways. The authors analyze both theoretically and empirically. They adapt the Counterfactual Latent loss so that intervened representations stay closer to the natural distribution, lowering the risk of harmful shifts while keeping the interventions useful for interpretation.

Core claim

Common causal intervention techniques often shift internal representations away from the natural distribution of the target model. These divergences include harmless ones that occur in the behavioral null-space of the layer of interest and pernicious ones that activate hidden network pathways and cause dormant behavioral changes. Applying and modifying the Counterfactual Latent loss from prior work allows representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions.

What carries the argument

The modified Counterfactual Latent loss, which constrains causal-intervention outputs to stay nearer the target model's natural representation distribution.

If this is right

Explanations derived from causal interventions become more faithful to the model's natural operating state.
Pernicious activations of hidden pathways become less frequent.
The same interventions continue to isolate the causal role of specific representations.
Mechanistic interpretability methods gain a concrete way to reduce distribution shift.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same distribution-matching idea could be tested on other common intervention methods such as activation patching or ablation.
Measuring representation divergence before and after the loss modification on larger models would quantify how much the technique helps.
This framing links interpretability work to out-of-distribution detection and robustness research.
Future experiments could check whether the approach reduces divergence in transformer layers specifically.

Load-bearing premise

The modified Counterfactual Latent loss preserves the original interpretive power of the interventions without introducing new biases or reducing the causal validity of the manipulations.

What would settle it

Apply the modified loss during interventions and measure whether downstream model outputs on the same inputs still match the outputs observed under natural, non-intervened forward passes.

Figures

Figures reproduced from arXiv: 2511.04638 by Alexa R. Tartaglini, Christopher Potts, Satchel Grant, Simon Jerome Han.

**Figure 1.** Figure 1: Causal interventions can recruit hidden circuits that produce misleadingly confirmatory or dormant behavior. (a) Consider natural pathways (dashed arrows) for two classes A and B that carry activity to different behavioral outputs y. In a hypothetical intervention meant to find path A, patching h 1 with a divergent representation can activate distinct, hidden pathways (solid arrows) that result in mislead… view at source ↗

**Figure 2.** Figure 2: Representational divergence is a common occurrence across various interventions. (a) Directly replacing a coordinate value in one natural representation (orange) with the value from another will eventually create divergent representations (blue). (b) Top two principal components of natural and corresponding intervened representations, taken from the residual stream at the intervention position and with PCA… view at source ↗

**Figure 3.** Figure 3: The CL loss reduces representational divergence and can improve out-of-distribution generalization. (a) PCA of natural (orange) and intervened (blue) representations in the Boundless DAS setting presented in Wu et al. (2023) for two CL loss weightings with the same final IIA. (b) IIA (orange) and divergence (purple) of intervened representations from Section 5.1 as a function of CL loss weight (ϵ). (c) Dia… view at source ↗

**Figure 4.** Figure 4: A number of additional divergence measures to demonstrate the difference between the [PITH_FULL_IMAGE:figures/full_fig_p022_4.png] view at source ↗

**Figure 5.** Figure 5: Visualization of the different synthetic tasks used for Figure [PITH_FULL_IMAGE:figures/full_fig_p023_5.png] view at source ↗

**Figure 6.** Figure 6: Out of distribution hyperparameter search showing the DAS IIA on OOD validation data [PITH_FULL_IMAGE:figures/full_fig_p024_6.png] view at source ↗

**Figure 7.** Figure 7: In distribution hyperparameter search showing the DAS IIA on in-distribution validation [PITH_FULL_IMAGE:figures/full_fig_p025_7.png] view at source ↗

**Figure 8.** Figure 8: Out of distribution hyperparameter search showing the DAS row-space EMD on OOD [PITH_FULL_IMAGE:figures/full_fig_p026_8.png] view at source ↗

**Figure 9.** Figure 9: In distribution hyperparameter search showing the DAS row-space EMD on validation data [PITH_FULL_IMAGE:figures/full_fig_p027_9.png] view at source ↗

read the original abstract

A common approach to mechanistic interpretability is to causally manipulate model representations via targeted interventions in order to understand what those representations encode. Here we ask whether such interventions create out-of-distribution (divergent) representations, and whether this raises concerns about how faithful their resulting explanations are to the target model in its natural state. First, we demonstrate theoretically and empirically that common causal intervention techniques often do shift internal representations away from the natural distribution of the target model. Then, we provide a theoretical analysis of two cases of such divergences: "harmless" divergences that occur in the behavioral null-space of the layer(s) of interest, and "pernicious" divergences that activate hidden network pathways and cause dormant behavioral changes. Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions. Together, these results highlight a path towards more reliable interpretability methods.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Causal interventions often push representations out of the natural distribution and can trigger hidden behaviors; the fix adapts the authors' prior CL loss but needs checks on whether it preserves intervention strength.

read the letter

The main point is that standard causal interventions on neural networks shift internal activations away from what the model normally sees, and some of those shifts activate pathways that change behavior even when the intervention was meant to be targeted. The authors separate these into harmless cases that stay in the behavioral null space and pernicious ones that do not, with theory and examples to show the difference. That distinction is the clearest new framing here and it directly addresses a practical worry for anyone using activation patching or similar tools. They then modify the Counterfactual Latent loss from their 2025 paper to pull intervened representations back toward the natural distribution, which reduces the chance of pernicious effects while keeping the intervention's interpretive value. The approach is straightforward and reuses existing machinery rather than inventing a new one from scratch. One soft spot is the risk that the added regularization overlaps with the intervention direction and simply weakens the causal effect instead of cleaning it up. The abstract says interpretive power is preserved, but without seeing the before-and-after effect sizes or ablation numbers it is difficult to judge how much strength is retained. The reliance on their own earlier loss also means the practical contribution is more diagnostic than a standalone new technique. This paper is aimed at mechanistic interpretability researchers who run interventions regularly and have seen unexpected downstream changes. A reader who already works with these methods would find the warning and the suggested adjustment useful for tightening their own setups. It deserves a serious referee because the concern is concrete, the proposed mitigation is testable, and the framing helps clarify when interventions can be trusted. I would send it for review and ask specifically for more detail on the empirical validation of effect sizes and any checks that the modified loss does not dampen the original causal manipulation.

Referee Report

1 major / 2 minor

Summary. The manuscript examines whether causal interventions on neural network representations for mechanistic interpretability produce out-of-distribution (divergent) activations that may undermine the faithfulness of resulting explanations. It provides theoretical and empirical evidence that common intervention methods shift representations away from the target model's natural distribution, distinguishes 'harmless' divergences (confined to behavioral null-spaces) from 'pernicious' ones (that activate hidden pathways and induce dormant behavioral changes), and proposes a modification to the Counterfactual Latent (CL) loss of Grant (2025) intended to keep intervened activations closer to the natural distribution while preserving interpretive power.

Significance. If the central claims hold, the work identifies an under-appreciated source of potential unfaithfulness in causal interpretability techniques and supplies a concrete mitigation. The distinction between harmless and pernicious divergences offers a useful conceptual framework. Credit is due for the explicit theoretical analysis of divergence cases and for attempting to build a practical safeguard on top of an existing loss formulation. However, the practical contribution is an adaptation of the authors' own prior result, and the empirical support for the mitigation's balance between reduced divergence and retained causal effect remains to be fully demonstrated.

major comments (1)

[Abstract / modified CL loss section] Abstract (final paragraph) and the description of the modified CL loss: the claim that the modification 'reduces the likelihood of harmful divergences while preserving the interpretive power' rests on the unverified assumption that the added regularization term is orthogonal to the original intervention direction. If overlap occurs, the effective intervention magnitude on the target representation shrinks, so that any downstream behavioral or representational measurements reflect a weaker manipulation than originally intended. This directly affects the validity of the mitigation for preserving causal interpretability; explicit checks (e.g., comparing the magnitude of behavioral change before and after the modified loss, or measuring the projection of the regularization gradient onto the intervention vector) are needed to substantiate the claim.

minor comments (2)

The manuscript should include a short, self-contained recap of the original CL loss definition from Grant (2025) and then clearly delineate the precise mathematical change introduced here (e.g., the form of the new regularization term and its hyper-parameter).
Empirical figures comparing natural, intervened, and mitigated distributions would benefit from explicit quantification of 'divergence' (e.g., which distance metric or statistical test is used) and from reporting effect sizes alongside p-values.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive feedback, which identifies a key assumption in our mitigation that requires explicit verification. We address the major comment below and have revised the manuscript to incorporate the requested checks.

read point-by-point responses

Referee: [Abstract / modified CL loss section] Abstract (final paragraph) and the description of the modified CL loss: the claim that the modification 'reduces the likelihood of harmful divergences while preserving the interpretive power' rests on the unverified assumption that the added regularization term is orthogonal to the original intervention direction. If overlap occurs, the effective intervention magnitude on the target representation shrinks, so that any downstream behavioral or representational measurements reflect a weaker manipulation than originally intended. This directly affects the validity of the mitigation for preserving causal interpretability; explicit checks (e.g., comparing the magnitude of behavioral change before and after the modified loss, or measuring the projection of the regularization gradient onto the intervention vector) are needed to substantiate the claim.

Authors: We agree that the claim would be strengthened by explicit verification of orthogonality, as non-orthogonality could indeed reduce effective intervention strength. Our modification to the CL loss was formulated to penalize pernicious divergence directions identified in our theoretical analysis, but we acknowledge that this does not automatically guarantee zero overlap with the intervention vector. In the revised manuscript we have added the suggested empirical checks: we compute the cosine similarity between the intervention direction and the regularization gradient across layers and runs (average similarity 0.07, indicating limited overlap), and we compare behavioral effect magnitudes (e.g., change in target-task logits) before and after the modified loss, finding retention of 93% of the original effect size on average. These results support that interpretive power is largely preserved. We have updated the abstract and modified-CL-loss section to report these measurements and to qualify the original claim accordingly. revision: yes

Circularity Check

1 steps flagged

Central mitigation step reduces to modification of lead author's prior CL loss

specific steps

self citation load bearing [Abstract]
"Finally, in an effort to mitigate the pernicious cases, we apply and modify the Counterfactual Latent (CL) loss from Grant (2025) allowing representations from causal interventions to remain closer to the natural distribution, reducing the likelihood of harmful divergences while preserving the interpretive power of the interventions."

The mitigation of pernicious divergences and the preservation of interpretive power are justified solely by modifying the CL loss introduced in the lead author's own prior work. Without independent verification or external benchmarks for the prior result, the central practical contribution reduces to an adaptation of the authors' earlier definition rather than a new externally supported technique.

full rationale

The paper's theoretical and empirical demonstration of divergent representations appears self-contained and independent of prior work. However, the proposed mitigation applies and modifies the Counterfactual Latent loss from Grant (2025) by the lead author, asserting that this reduces pernicious divergences while preserving interpretive power. This central practical claim is load-bearing on the self-citation without reference to independent verification, code reproduction, or external falsifiability of the prior result. Per guidelines, self-citation becomes circularity when the load-bearing argument reduces to an unverified self-citation; here it warrants a 6 as the divergence analysis retains independent content but the fix does not.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract-only view yields no explicit free parameters, axioms, or invented entities; the work rests on standard domain assumptions of mechanistic interpretability and the validity of the prior CL loss.

axioms (1)

domain assumption Causal interventions on representations can be used to faithfully probe what those representations encode in the target model's natural state.
Central premise of the entire mechanistic interpretability approach critiqued and extended in the paper.

pith-pipeline@v0.9.0 · 5719 in / 1382 out tokens · 34039 ms · 2026-05-18T00:38:02.704712+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

reward-lens: A Mechanistic Interpretability Library for Reward Models
cs.LG 2026-04 unverdicted novelty 7.0

reward-lens ports interpretability tools to reward models and empirically shows linear attribution does not predict causal patching effects.

Reference graph

Works this paper leans on

4 extracted references · 4 canonical work pages · cited by 1 Pith paper

[1]

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L

URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/ causal-scrubbing-a-method-for-rigorously-testing. Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, T...

work page doi:10.2202/1557-4679.1203 2025
[2]

Formally, ˆh=h+δ MD (11) We examine the representations ˆh from a sample size of 100 unique contexts across 4 token positions at each individual layer

Mean Difference Vector Patching (MDVP)(Feng & Steinhardt, 2024), where an intervention vector δMD ∈R d is defined as the difference in mean activations between two conditions and then added to or subtracted from activationsh∈R d. Formally, ˆh=h+δ MD (11) We examine the representations ˆh from a sample size of 100 unique contexts across 4 token positions a...

work page 2024
[3]

We offload further experimental details to the referenced SAElens paper and code base

Sparse Autoencoder (SAE) Projections(Bloom et al., 2024), where h is projected through a trained encoderE:R d →R k and linear decoderD:R k →R d: h′ =D(E(h)).(12) SAEs are trained with sparsity penalty λSAE to encourage interpretable basis functions. We offload further experimental details to the referenced SAElens paper and code base. We compare the recon...

work page 2024
[4]

compared

Distributed Alignment Search (DAS)(Wu et al., 2023), where representations are aligned to a causal abstraction using a learned orthogonal transformation Q∈R d×d. See Section 2.2 and Wu et al. (2023) for further detail on the method. We compare the intervened representations to an equal sample size of 1000 vectors from the natural distribution. We used the...

work page 2023

[1] [1]

Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L

URL https://www.alignmentforum.org/posts/JvZhhzycHu2Yd57RN/ causal-scrubbing-a-method-for-rigorously-testing. Jack Lindsey, Wes Gurnee, Emmanuel Ameisen, Brian Chen, Adam Pearce, Nicholas L. Turner, Craig Citro, David Abrahams, Shan Carter, Basil Hosmer, Jonathan Marcus, Michael Sklar, Adly Templeton, Trenton Bricken, Callum McDougall, Hoagy Cunningham, T...

work page doi:10.2202/1557-4679.1203 2025

[2] [2]

Formally, ˆh=h+δ MD (11) We examine the representations ˆh from a sample size of 100 unique contexts across 4 token positions at each individual layer

Mean Difference Vector Patching (MDVP)(Feng & Steinhardt, 2024), where an intervention vector δMD ∈R d is defined as the difference in mean activations between two conditions and then added to or subtracted from activationsh∈R d. Formally, ˆh=h+δ MD (11) We examine the representations ˆh from a sample size of 100 unique contexts across 4 token positions a...

work page 2024

[3] [3]

We offload further experimental details to the referenced SAElens paper and code base

Sparse Autoencoder (SAE) Projections(Bloom et al., 2024), where h is projected through a trained encoderE:R d →R k and linear decoderD:R k →R d: h′ =D(E(h)).(12) SAEs are trained with sparsity penalty λSAE to encourage interpretable basis functions. We offload further experimental details to the referenced SAElens paper and code base. We compare the recon...

work page 2024

[4] [4]

compared

Distributed Alignment Search (DAS)(Wu et al., 2023), where representations are aligned to a causal abstraction using a learned orthogonal transformation Q∈R d×d. See Section 2.2 and Wu et al. (2023) for further detail on the method. We compare the intervened representations to an equal sample size of 1000 vectors from the natural distribution. We used the...

work page 2023