pith. sign in

arxiv: 2605.29126 · v1 · pith:O4WMX5QInew · submitted 2026-05-27 · 💻 cs.LG · cs.AI

When and How Long? The Readout-Mediator Angle in Temporal Reasoning

Pith reviewed 2026-06-29 13:28 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords linear probesdistributed alignment searchtemporal reasoningcausal mediationlanguage modelsrepresentation geometryorthogonalitycalendar date reasoning
0
0 comments X

The pith

Linear probes decode day-of-year accurately yet remain orthogonal to the causal subspace a model uses for date-to-duration reasoning.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

In language models performing calendar-date duration reasoning, a sin/cos linear probe recovers day-of-year from layer activations with near-perfect accuracy. Ablating the probe direction leaves model outputs unchanged, while ablating a four-dimensional subspace identified by distributed alignment search at the same layer collapses performance. The angle between these two subspaces matches the distribution expected for unrelated random directions under the Haar measure. Reverse-engineering the circuit shows attention heads route month context via learned QK offsets at specific day intervals and MLPs convert absolute dates into durations downstream of the DAS subspace. The dissociation between probe and mediator holds across four model scales and two families, with early signs in spatial and arithmetic domains.

Core claim

The paper establishes that a linear probe can decode a representation almost perfectly yet lie orthogonal to the model's actual computation: on calendar-date duration reasoning the sin/cos probe recovers day-of-year from a layer's activations, but ablating its direction has no effect while ablating the four-dimensional DAS subspace at the same layer collapses performance entirely; the readout-mediator angle is indistinguishable from the Haar-uniform null, attention heads implement month-grained routing through QK offsets at ±30 and ±61 days, MLPs perform the when-to-how-long conversion, and sparse autoencoders confirm the probe-aligned and DAS-aligned features encode disjoint concepts.

What carries the argument

The readout-mediator angle between a linear probe's decoding direction and the causal mediator subspace identified by distributed alignment search.

If this is right

  • Attention heads route month-grained context through learned QK offsets at ±30 and ±61 days before MLPs convert absolute date to duration downstream of the causal subspace.
  • Sparse-autoencoder features aligned to the probe versus the DAS subspace encode semantically disjoint concepts with negligible causal overlap.
  • The orthogonality between readout and mediator replicates across model scales from 1.5B to 9B parameters and two families.
  • The same dissociation appears in preliminary tests on spatial displacement and symbolic arithmetic.
  • Proposals to deploy probes as runtime safety monitors are undermined because high probe accuracy can coexist with zero causal relevance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If readout-mediator orthogonality occurs across many tasks, then interpretability conclusions drawn from high-accuracy probes may systematically miss the computations that actually drive outputs.
  • Causally constrained probe training could be tested as a way to reduce the angle between decoded directions and true mediators.
  • This geometry may explain why some probe-based interventions fail to affect model behavior even when accuracy is high.
  • The finding raises the possibility that similar orthogonalities exist in other representation geometries where linear probes appear successful.

Load-bearing premise

That the four-dimensional subspace found by DAS is the true causal mediator for the when-to-how-long conversion rather than a correlated but non-causal direction.

What would settle it

Measuring whether ablating the DAS subspace still collapses performance after the model is fine-tuned to use an alternative date-to-duration strategy, or observing a readout-mediator angle significantly below the Haar null in additional domains.

Figures

Figures reproduced from arXiv: 2605.29126 by Felix Wyss, Praitayini Kanakaraj, Shreyas Fadnavis.

Figure 1
Figure 1. Figure 1: The readout-mediator dissociation on a duration query (schematic). Given “How many days between March 15th and June 22nd?”, the model deploys two functionally orthogonal subspaces. Spy P: the probe subspace UP passively decodes both dates; ablating it changes accuracy by −0.6 pp. Spy M: the DAS mediator UM counts 99 days via month-boundary hops; ablating it collapses accuracy to 0%. The two subspaces are n… view at source ↗
Figure 2
Figure 2. Figure 2: The readout-mediator dissociation, quantified. (A) Accuracy on GEMMA 2 2B under four ablations at L ⋆=1: DAS collapses accuracy to 0%; probe and random ablations produce drops within 1 pp of clean. (B) Mean principal angle between DAS and probe subspaces at each k; shaded band is the Haar-random null θ= arccosp k/d. (C) Causal specificity ratio ρ = ∆DAS/∆random at k=2, 4, 6; DAS is 190–1050× more damaging … view at source ↗
Figure 3
Figure 3. Figure 3: A distributed, offset-structured, cross-family circuit. (A) Per-head QK-twist |z| on GEMMA 2 2B (26 layers ×8 heads); 24 BH-significant boundary heads at |z|≥3. (B) Detected QK offsets: Gemma BH-significant heads and QWEN 2.5 1.5B top-20. Both families cluster at ±30 and ±61 days; neither at ±7. (C) Dose–response: accuracy drop vs. fraction of boundary heads ablated; super-linear scaling in both families. … view at source ↗
Figure 4
Figure 4. Figure 4: DAS energy, MLP contribution, and TFA subspace geometry. (A) DAS subspace energy fraction through all 26 residual-stream layers: peaks at 26.4× Haar null at L ⋆=1 and exceeds null at every layer (range 2.1–26.4×). Probe subspace energy tracks the random null throughout (0.0–4.1×). (B) Per-layer MLP contribution to DAS vs. probe subspace (L18–L25). Probe contribution peaks at L19 (4.3× null, calendar date) … view at source ↗
Figure 5
Figure 5. Figure 5: Circuit wiring and feature-level dissociation: probe reads when; DAS computes how long. (A) Cir￾cuit graph: L ⋆=1 DAS mediator (26.4× null) → boundary heads L11H4/L12H6 (∆NLL=0.45) → MLP L18–25 (when → how long) → relay hub L24H2 (#2309, duration vocabulary) → output. Probe (dashed) has at￾tribution = 0.000. At L ⋆ , DAS features encode copula syntax; temporal semantics emerge at the L=24 relay hub. (B) SA… view at source ↗
Figure 6
Figure 6. Figure 6: Adversarial dissociation. Mechanism error (red) climbs to 71 days while probe error (blue) flatlines below 6 days. At α=3, 93% of damage is undetected. The blind spot is generic: all seven probe targets tested (day-of-year, month, season, day-of-week, quarter, solstice, gradient) land within 2.8 ◦ of the Haar-random null. This is a geometric conse￾quence of k ≪ d: in 2304 dimensions, a rank-4 probe subspac… view at source ↗
Figure 7
Figure 7. Figure 7: Emergence in PYTHIA 1.4B. Three diagnostics on a shared log-training-step x-axis; gold band spans the geometric emergence window [103 , 5×104 ]. (A) Probe R 2 is already > 0.95 at step 0 and moves negligibly—uninformative of mechanism learning. (B) Boundary-head count collapses from 133 spurious ridges to ∼50 task-tuned heads by step 1k, then stabilizes at 62. (C) FFT circularness at L ⋆ grows 37× within t… view at source ↗
Figure 8
Figure 8. Figure 8: The readout-to-mediator spectrum. (A) Accuracy under four ablation conditions at L ⋆=1 on GEMMA 2 2B: clean baseline, linear probe, top-50 GemmaScope SAE features, and the causal DAS subspace. (B) The same tools placed on a readout→mediator ruler, positioned by specificity ratio ρk. Probe (ρ=1.0, noise), SAE-50 (ρ=288, partial causal readout), DAS (ρ=1050, full mediator). Tool Ablation drop ρk vs. null d/k… view at source ↗
Figure 9
Figure 9. Figure 9: Calibration of δ(x) for flagging wrong clinical answers (GEMMA 2 2B, k=12, n=75). (a) Reliability diagram (5 equal-count δ bins). (b) ROC with Youden-J optimal threshold marked. (c) Precision-recall. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Manifold deviation δ(x) predicts clinical duration error (open 979-query three-tier benchmark). (A) Per-query absolute duration error (symlog, days) versus δ(x) at k ⋆=12; correct (±20%, green) and wrong (red) cases (n=979), with marginal δ-density strip and Pearson r (bootstrapped 95% CI, permutation p) inset. (B) Pearson r as a function of subspace dimension k (stable for k∈[6, 12]; gold ring marks k ⋆ … view at source ↗
Figure 11
Figure 11. Figure 11: Manifold deviation on the 979-query three-tier clinical benchmark — detailed view. (A) ROC per tier for δ(x) flagging wrong answers. (B) Pooled reliability: fraction-wrong in 5 equal-count δ-bins. (C) Per-tier Pearson r(δ, |err|) by duration magnitude. trivia (n=20), and short naturalistic completions (n=20)—running each under (A) clean, (B) DAS￾k=4 ablation, and (C) each of 25 random-k=4 ablations. For e… view at source ↗
Figure 12
Figure 12. Figure 12: Empirical Pcos2 θi in raw (a) and whitened (b) coordinates. Gray points and ±2σ bars: 104 -draw Haar random k-subspace null; red squares: observed probe vs DAS; blue triangles: analytic k 2 /d null. S25 Supplement: angle indistinguishability tests and specificity CIs Hypothesis tests. For each of the four scales, we Monte-Carlo the null distribution of Pcos2 θi between a Haar-random k=4 subspace and the t… view at source ↗
Figure 13
Figure 13. Figure 13: Probe–DAS mean principal angle across all 26 layers of [PITH_FULL_IMAGE:figures/full_fig_p029_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: Effective mediator dimension on QWEN 2.5 1.5B. (a) Ablation drop saturates at k=4; higher k admits multiple spanning bases and ablation is noisier to optimize. (b) Mean cos θi between bases at different k: no nested-subspace structure; larger-k DAS does not strictly extend smaller-k. Interpretation. Combined with the Grassmannian-scatter result at k=6 on GEMMA 2 2B (§4), the effective mediator dimension i… view at source ↗
Figure 15
Figure 15. Figure 15: TFA components on the Grassmannian. (A) MDS embedding of subspaces on Gr(4, 2304) using mean principal angle as the distance metric. Gray dots: 100 Haar-random k-frames (convex hull shaded). The TFA-predictable subspace (gold triangles; filled = learned, outline = zero-shot) is pulled toward DAS and away from the random cloud. The probe (blue square) sits in the random cloud at 88.5 ◦ . (B) Grassmannian d… view at source ↗
Figure 16
Figure 16. Figure 16: Extended alignment of TFA predictable and novel components with the DAS mediator (left, [PITH_FULL_IMAGE:figures/full_fig_p035_16.png] view at source ↗
Figure 17
Figure 17. Figure 17: Feature-level detectors and causal specificity. (A) Month-tiling: 11 of 12 months have dedicated SAE features at Layer 3, each firing selectively on its month token. (B) Causal specificity ratio ρ: DAS ablation drops accuracy by 42 pp (ρ=1050×); probe ablation has ρ=0.8 (INERT); gradient probe achieves ρ=19.1; PCA reaches ρ=15. (C) TFA novel-computation projection: the predictable component aligns 7× Haar… view at source ↗
Figure 18
Figure 18. Figure 18: Temporal dynamics of mediator energy (emed, gold bars) across token positions for three Set-F duration prompts. The k/d random baseline is shown as a dashed line. Left: per-token mediator energy with token labels. Right: pairwise cosine similarity of DAS-projected activations, showing block structure around date-bearing positions. 37 [PITH_FULL_IMAGE:figures/full_fig_p037_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: Monte Carlo null distribution of offset-mode coincidences ( [PITH_FULL_IMAGE:figures/full_fig_p038_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: DAS mediator energy through 26 layers. DAS subspace energy (blue) vs. probe subspace energy (orange) vs. Haar random null (dashed). Boundary-head layers shaded. DAS energy exceeds random null at every layer; probe energy does not. S44 Supplement: per-direction DAS ablation To test whether the rank-4 mediator subspace operates as a cooperative unit or decomposes into independent directions, we ablate each … view at source ↗
Figure 21
Figure 21. Figure 21: Per-direction DAS ablation. (A) Individual direction ∆NLL (dashed: full k=4 ablation). (B) Pair￾wise observed vs. expected (sum of singles). Super-additive pairs lie above the diagonal. S45 Supplement: MLP vs. attention component attribution Zero-ablation of attention output vs. MLP output at layers 18–25 reveals that MLP sub-layers carry the dominant causal signal for duration computation ( [PITH_FULL_I… view at source ↗
Figure 22
Figure 22. Figure 22: MLP vs. attention attribution at layers 18–25. MLP ablation consistently increases NLL; attention ablation is near-zero or negative at late layers. 41 [PITH_FULL_IMAGE:figures/full_fig_p041_22.png] view at source ↗
Figure 23
Figure 23. Figure 23: Cascading ablation. (A) ∆NLL by head group. (B) Observed combined drop vs. sum of group drops (weakly super-additive). S47 Supplement: MLP SAE computation analysis We use GemmaScope MLP SAEs (gemma-scope-2b-pt-mlp-canonical, 16K features per layer, Wdec ∈ R 16384×2304) as a transcoder substitute to decompose what each MLP writes to the residual stream into interpretable features. Unlike residual-stream SA… view at source ↗
Figure 24
Figure 24. Figure 24: GemmaScope transcoder features along the MLP pipeline. (a) Feature #14897 at L20 marks the read→write transition where probe alignment peaks and DAS alignment begins rising. (b) Feature #7290 at L25 has 15.9× DAS Haar ratio and promotes copula tokens (is, was) in logit-lens projection, confirming transcoders independently identify the same syntactic backbone as MLP SAEs. Group decoder steering. DAS-aligne… view at source ↗
Figure 25
Figure 25. Figure 25: Error-node analysis. (A) Reconstruction gap G(k) vs. top-k SAE features per DAS direction. G=1.00 at every depth: SAE-reconstructed directions capture zero causal effect via steering. (B) ∆NLL comparison at the best-k SAE reconstruction. DAS full ablation (+69.1) dwarfs all steering conditions. L18 L19 L20 L21 L24 L25 MLP Layer 0.00 0.02 0.04 0.06 0.08 0.10 DAS subspace coverage (top-50 features span) 0.0… view at source ↗
Figure 26
Figure 26. Figure 26: Transcoder vs MLP SAE. (A) DAS subspace span coverage (top-50 features) across MLP layers: both dictionaries achieve 5–9%. (B) L1 residual-stream comparison. (C) Steering comparison at L20: both TC and SAE features produce ∆NLL indistinguishable from random controls. 1 2 3 5 10 Group size k −0.04 −0.03 −0.02 −0.01 0.00 0.01 0.02 0.03 0.04 ΔNLL (duration prompts) DAS-aligned Random Probe-aligned 1 2 3 5 10… view at source ↗
Figure 27
Figure 27. Figure 27: Group decoder steering. (A) ∆NLL by group size k: DAS-aligned and random feature groups are indistinguishable at every k. (B) Super-additivity ratio SA(k): no super-additive cooperation emerges via decoder-direction steering. 46 [PITH_FULL_IMAGE:figures/full_fig_p046_27.png] view at source ↗
Figure 28
Figure 28. Figure 28: Frozen-attention steering. (A) ∆NLL across seven conditions; all values |∆NLL|<0.1, three orders of magnitude below DAS ablation. (B) Pathway decomposition pie chart (uninterpretable due to noise-floor base effect). S50 Supplement: GemmaScope SAE feature dissociation We use GemmaScope 16K residual-stream SAEs [Lieberum et al., 2024] to ground the geometric probe–DAS dissociation at the feature-dictionary … view at source ↗
Figure 29
Figure 29. Figure 29: NeuronPedia feature audit. Probe-aligned feature #12499 (a) activates on calendar months and promotes month/Month in its logits. DAS-aligned features encode syntactic structure: #14703 (b) at L ⋆=1 activates on copula “is”, #2309 (c) at relay hub L=24 promotes duration vocabulary, and relay midpoint #15596 (d) at L=12 carries the copula chain. The probe and DAS directions decompose into semantically disjo… view at source ↗
Figure 30
Figure 30. Figure 30: NeuronPedia steering demonstrations. (a) Amplifying probe feature #12499 at high strength floods the output with calendar-month vocabulary—pure month fixation. (b) Amplifying DAS duration feature #2309 at L=24 produces obsessive duration-unit enumeration (days, weeks, months, seconds). (c) Strongly suppressing copula feature #14703 causes near-total generation collapse, confirming the syntactic backbone i… view at source ↗
Figure 31
Figure 31. Figure 31: Temporal-specialist SAE features at L=12. (a) Feature #3087 from the Lubana et al. temporal SAE (canrager/temporalSAEs, 9,216 features) is DAS-aligned and matches GemmaScope relay feature #15596 (cosine=0.59), recovering the copula relay in an independently trained dictionary. (b) Feature #8935 is probe-aligned and activates on date-entry contexts, encoding calendar position rather than duration—the probe… view at source ↗
Figure 32
Figure 32. Figure 32: OV circuit decomposition. (A) Heatmap of WOV DAS-alignment per head; boundary heads (gold squares) show only a modest 1.17× enrichment. (B) Violin comparison; the effect is statistically significant (p=0.004) but substantively small, confirming the circuit is QK-mediated rather than OV-mediated. S52 Supplement: attribution flow graph To visualize the full circuit structure, we build a directed attribution… view at source ↗
Figure 33
Figure 33. Figure 33: Attribution flow graph. Directed edges show information flow from the DAS mediator (L ⋆=1, green) through intermediate hubs to boundary heads (gold). Gold arrows: flow to boundary heads. Coral arrows: relay through L24H2 hub. Node size proportional to AP score [PITH_FULL_IMAGE:figures/full_fig_p052_33.png] view at source ↗
Figure 34
Figure 34. Figure 34: Cross-probe universality. (A) Measured angle vs. Haar null for seven probe architectures; all fall within the ±2σ null band (gray). (B) Subspace overlap (Pcos2 θi/ min(k1, k2)) between each probe and DAS; all are ≤2.2%. (C) Probe accuracy/R2 vs. angle to DAS: high performance does not predict causal alignment. target day d ⋆=((d+179) mod 365)+1 (∼180 days away) and construct: xadv = x + αU⊤ MUM [PITH_FUL… view at source ↗
Figure 35
Figure 35. Figure 35: Mutual information. (A) MI bar chart: I(probe; DAS) and I(DAS; DOY) are at the null floor; only I(probe; DOY) is non-trivial. (B) Phase-shuffle null distribution with observed MI (red line) firmly within it (p=1.0). (C) Scatter of probe DOY vs. DAS energy: no structure. Extended specificity battery (Exp. 113). Using the ∆NLL values from 100 Set-F duration prompts on a single T4 instance, we compute ρ for … view at source ↗
Figure 36
Figure 36. Figure 36: Specificity battery. (A) ρ per subspace; DAS is off the chart at 2650×. (B) Null ρ distribution with named subspaces marked. (C) ∆NLL per subspace with random ±2σ band. Mock deception probe (Exp. 110). We train a logistic-regression “confidence monitor” on L ⋆=1 activations to predict whether the model is confident (NLL below median) or uncertain (NLL above median) about each duration prompt. The monitor … view at source ↗
Figure 37
Figure 37. Figure 37: Mock deception probe. (A) NLL distribution split at median into confident/uncertain labels. (B) ρ comparison: DAS (2650×) vs. temporal probe (−6.5×) vs. safety monitor (≤2.0×). (C) Principal angle between the safety monitor and DAS sits at the Haar null. 5 10 15 20 Probe angular shift (days) 0 2 4 6 8 10 Count Probe shift under ablation Random ablations DAS ablation (16.65d) 0 100 200 300 Day of year 0 5 … view at source ↗
Figure 38
Figure 38. Figure 38: Ablation invisibility. (A) Probe-shift histogram for 50 random ablations (gray) vs. DAS ablation (red). (B) Per-DOY probe shift under DAS ablation; most are below the 3-day threshold. (C) Energy decomposition: DAS holds 4.6% of activation norm but 100% of causal effect. S54 Supplement: notation and abbreviations Symbol Definition d Residual-stream dimension (2304 for Gemma 2 2B; 3584 for 9B) k Rank of pro… view at source ↗
read the original abstract

A linear probe can decode a representation almost perfectly and yet be completely irrelevant to how the model uses it. On calendar-date duration reasoning in language models, a $\sin$/$\cos$ probe recovers day-of-year from a layer's activations, yet ablating its direction has no effect on the model's answers -- while ablating a four-dimensional subspace found by Distributed Alignment Search (DAS) at the same layer collapses performance entirely. We measure the angle between these two subspaces -- the \emph{readout-mediator angle} -- and find it indistinguishable from the angle between two random subspaces (the Haar-uniform null), meaning the probe has learned a direction orthogonal to the model's actual computation. Reverse-engineering the circuit reveals why: attention heads route month-grained context through learned QK offsets at ${\pm}30$ and ${\pm}61$ days, and MLPs then convert \emph{when} (absolute date) into \emph{how long} (duration) -- all downstream of the causal subspace the probe never touches. Sparse-autoencoder decomposition confirms the split: probe-aligned and DAS-aligned features encode semantically disjoint concepts with negligible causal overlap. The dissociation replicates across four scales ($1.5$-$9\,$B) and two model families, with preliminary evidence on two further domains (spatial displacement, symbolic arithmetic), suggesting that readout-mediator orthogonality is a general failure mode of probe-based interpretability. This directly undermines proposals to deploy probes as runtime safety monitors: the probe can report high confidence on a direction the model has silently abandoned.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that linear probes can achieve high decoding accuracy on representations while remaining orthogonal to a model's actual causal computation. On a calendar-date duration reasoning task, a sin/cos probe recovers day-of-year from layer activations, but ablating its direction leaves model performance intact; in contrast, ablating a 4D subspace identified via Distributed Alignment Search (DAS) at the same layer collapses performance. The angle between the probe direction and the DAS subspace is reported as indistinguishable from the angle between random subspaces under a Haar-uniform null on the Grassmannian. Circuit reverse-engineering (attention QK offsets at ±30/±61 days and downstream MLP conversion from absolute date to duration), SAE feature decomposition showing disjoint concepts, and replication across four model scales (1.5B–9B) and two families are presented as supporting evidence, with preliminary results on spatial displacement and symbolic arithmetic suggesting the readout-mediator orthogonality is a general failure mode of probe-based interpretability.

Significance. If the central geometric and causal dissociation holds, the work identifies a systematic limitation of probe-based interpretability: high probe accuracy need not imply relevance to the model's computation. The explicit credit for cross-scale replication across model families, mechanistic circuit details, and SAE confirmation of semantic disjointness provides concrete empirical grounding. This has direct implications for proposals to deploy probes as runtime safety monitors, as the probe may report on directions the model has abandoned.

major comments (2)
  1. [Abstract] Abstract (readout-mediator angle claim): the conclusion that the probe 'has learned a direction orthogonal to the model's actual computation' rests on the angle being 'indistinguishable from the angle between two random subspaces (the Haar-uniform null)'. The manuscript must specify the statistical procedure (sample size for the null distribution, test statistic, and p-value threshold) and justify why the Haar measure on the Grassmannian is the correct null for non-causal directions within the structured geometry of residual-stream activations.
  2. [DAS ablation results] DAS ablation and causal mediation (implied methods section): while ablating the 4D DAS subspace is stated to collapse performance 'entirely', the claim that this subspace is the true causal mediator (rather than a correlated but non-exhaustive direction) requires evidence that the subspace is minimal and exhaustive for the when-to-how-long conversion. Reporting whether performance remains at chance after the 4D ablation or whether additional dimensions yield further degradation would directly address whether orthogonality to this specific subspace establishes irrelevance to the full computation.
minor comments (2)
  1. [Results] The abstract mentions replication across scales and families but does not include a summary table of per-model angle measurements, ablation effect sizes, and statistical comparisons; adding such a table in the results section would improve readability and allow readers to assess consistency directly.
  2. [Methods] Notation for the readout-mediator angle and the precise definition of the 4D subspace (e.g., how the Grassmannian distance is computed) should be introduced with an equation in the methods to avoid ambiguity when comparing to the Haar null.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address the two major comments point-by-point below. Both points can be addressed through clarifications and additional reporting that we will incorporate in the revision.

read point-by-point responses
  1. Referee: [Abstract] Abstract (readout-mediator angle claim): the conclusion that the probe 'has learned a direction orthogonal to the model's actual computation' rests on the angle being 'indistinguishable from the angle between two random subspaces (the Haar-uniform null)'. The manuscript must specify the statistical procedure (sample size for the null distribution, test statistic, and p-value threshold) and justify why the Haar measure on the Grassmannian is the correct null for non-causal directions within the structured geometry of residual-stream activations.

    Authors: We agree that the statistical procedure requires explicit description. In the revised manuscript we will add a dedicated paragraph (or appendix) stating the exact procedure used: the number of Monte Carlo samples drawn from the Haar measure, the precise definition of the subspace angle (principal angles between the 1D probe direction and the 4D DAS subspace), and the criterion for 'indistinguishability' (e.g., the observed angle lying inside the central 95 % interval of the null distribution or a Kolmogorov-Smirnov test against the null). On the choice of null, the Haar measure is the unique rotationally invariant probability measure on the Grassmannian and therefore supplies the natural baseline for asking whether any preferred alignment exists; we will expand the text to note that, while residual-stream geometry is structured, the null still tests the specific hypothesis of alignment with the identified causal subspace rather than with the entire activation manifold. We will also report sensitivity checks under alternative sampling schemes if space allows. revision: yes

  2. Referee: [DAS ablation results] DAS ablation and causal mediation (implied methods section): while ablating the 4D DAS subspace is stated to collapse performance 'entirely', the claim that this subspace is the true causal mediator (rather than a correlated but non-exhaustive direction) requires evidence that the subspace is minimal and exhaustive for the when-to-how-long conversion. Reporting whether performance remains at chance after the 4D ablation or whether additional dimensions yield further degradation would directly address whether orthogonality to this specific subspace establishes irrelevance to the full computation.

    Authors: We will augment the results section with the requested controls. Specifically, we will report (i) the exact post-ablation accuracy after the 4D DAS intervention and confirm it is statistically indistinguishable from chance, and (ii) the effect of ablating the top-5D and top-6D DAS subspaces, showing that further dimensions produce no additional drop. These data, together with the existing circuit-level evidence that the 4D subspace captures the month-grained QK offsets and the downstream MLP conversion, support that the identified subspace is both minimal and exhaustive for the causal pathway under study. The revised text will make this explicit. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical ablations and external null model are independent of fitted inputs

full rationale

The paper's central claims rest on empirical ablation experiments (probe direction vs. DAS 4D subspace) and a geometric comparison of the readout-mediator angle against the Haar-uniform null on the Grassmannian. These quantities are not derived from equations that reduce to the same data used to identify the subspaces; the null distribution is an external mathematical reference, and performance drops are measured outcomes rather than fitted predictions. No self-definitional steps, fitted-input-as-prediction patterns, or load-bearing self-citations appear in the derivation chain. The work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The work is empirical and applies existing tools (linear probes, DAS, SAEs) to a new task; no free parameters, mathematical axioms, or new postulated entities are introduced or required by the central claim in the abstract.

pith-pipeline@v0.9.1-grok · 5822 in / 1513 out tokens · 40306 ms · 2026-06-29T13:28:41.115277+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 10 canonical work pages · 4 internal anchors

  1. [1]

    Gemma 2: Improving Open Language Models at a Practical Size

    Gemma Team. Gemma 2: Improving open language models at a practical size.arXiv preprint arXiv:2408.00118,

  2. [2]

    When models manipulate manifolds: The geometry of a counting task

    W. Gurnee, E. Ameisen, I. Kauvar, J. Tarng, A. Pearce, C. Olah, and J. Batson. When models manipulate manifolds: The geometry of a counting task.arXiv preprint arXiv:2601.04480,

  3. [3]

    Hewitt and P

    10 J. Hewitt and P. Liang. Designing and interpreting probes with control tasks. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing (EMNLP), pages 2733–2743,

  4. [4]

    Language models use trigonometry to do addition, 2025

    S. Kantamneni and M. Tegmark. Language models use trigonometry to do addition.arXiv preprint arXiv:2502.00873,

  5. [5]

    Gemma Scope: Open Sparse Autoencoders Everywhere All At Once on Gemma 2

    T. Lieberum, S. Rajamanoharan, et al. Gemma scope: Open sparse autoencoders everywhere all at once on Gemma 2.arXiv preprint arXiv:2408.05147,

  6. [6]

    Accessed: 2025-04-01. E. S. Lubana, C. Rager, S. S. R. Hindupur, V . Costa, G. Tuckute, O. Patel, S. K. Murthy, T. Fel, D. Wurgaft, E. J. Bigelow, J. Lin, D. Ba, M. Wattenberg, F. Viegas, M. Weber, and A. Mueller. Priors in time: Missing inductive biases for language model interpretability. InThe Fourteenth International Conference on Learning Representat...

  7. [7]

    The ori- gins of representation manifolds in large language models

    A. Modell, P. Rubin-Delanchy, and N. Whiteley. The origins of representation manifolds in large language models.arXiv preprint arXiv:2505.18235,

  8. [8]

    doi: 10.1162/COLI.a.572. A. Nam, H. Conklin, Y . Yang, T. Griffiths, J. Cohen, and S.-J. Leslie. Causal head gating: A framework for interpreting roles of attention heads in transformers. InAdvances in Neural Information Processing Systems (NeurIPS),

  9. [9]

    A. S. Okatan, M. ˙I. Akba¸ s, L. N. Kandel, and B. Peköz. Seed-induced uniqueness in transformer models: Subspace alignment governs subliminal transfer.arXiv preprint arXiv:2511.01023,

  10. [10]

    A. N. Tak, A. Banayeeanzade, A. Bolourani, M. Kian, R. Jia, and J. Gratch. Mechanistic interpretability of emotion inference in large language models. InFindings of the Association for Computational Linguistics: ACL 2025,

  11. [11]

    URL https: //transformer-circuits.pub/2024/scaling-monosemanticity/. J. Wang, X. Ge, W. Shu, Z. He, and X. Qiu. Dimensional collapse in transformer attention outputs: A challenge for sparse dictionary learning.arXiv preprint arXiv:2508.16929,

  12. [12]

    Qwen2.5 Technical Report

    A. Yang et al. Qwen2.5 technical report.arXiv preprint arXiv:2412.15115,

  13. [13]

    A. Zou, L. Phan, S. Chen, J. Campbell, P. Guo, R. Ren, A. Pan, X. Yin, M. Mazeika, A.-K. Dombrowski, et al. Representation engineering: A top-down approach to AI transparency.arXiv preprint arXiv:2310.01405,

  14. [14]

    Our setting differs in two ways that make standard DAS the right choice

    12 S1 Supplement: extended DAS results and implementation Why standard DAS and not HyperDAS.HyperDAS [Sun et al., 2025] trains a separate hyper- network to automate the search over token positions, which is valuable when the feature location is unknown. Our setting differs in two ways that make standard DAS the right choice. First, L⋆ is already fixed by ...

  15. [15]

    The data bear this out asymmetrically: UG is measurably closer to UM than to noise, but not close to recovering UM (¯θ is 2.3◦ below null, not 0◦)

    That proposition claims the mediator is shaped by ∇xf (first moment) and the probe by covariance with the target (second moment). The data bear this out asymmetrically: UG is measurably closer to UM than to noise, but not close to recovering UM (¯θ is 2.3◦ below null, not 0◦). The effective rank of 76 explains why: ∇xf at each prompt is a different vector...

  16. [16]

    QK-twist magnitude by layer.The maximal |z| per layer traces a unimodal curve: 1.8 at L=0, rising to 7.3 at L=5, falling to <2 by L=12 on GEMMA. Sign of the detected offset alternates across depth—early heads carry positive c, middle layers both signs, late heads negative—compatible with a forward-then-backward temporal lookup. S5 Supplement: attribution-...

  17. [17]

    lift (p<10−2)

    133 spurious 44 B spurious → task-tuned 100 101 102 103 104 105 Training step 0.00 0.02 0.04 0.06 0.08Circularness (FFT) 37× growth from init C geometry emerges in window Emergence in Pythia 1.4Bprobe R 2 saturates at init · boundary heads consolidate · geometry emerges only inside the window Figure 7:Emergence in PYTHIA1.4B.Three diagnostics on a shared ...

  18. [18]

    argued manifolds reflect translational symmetries in pretraining data. We build directly on this line while adding the causal (DAS / ablation), cross-family (universality population-vs-coordinate), training-dynamical (Pythia emergence), and deployment (clinical-δ(x)) layers. Gurnee et al. (2025): manifold manipulation.The closest theoretical antecedent. T...

  19. [19]

    dark matter

    introduce MP- SAE, a sparse autoencoder whose encoder unrolls matching pursuit into residual-guided steps, and formalizeconditional orthogonality—orthogonality across hierarchy levels but not within. The readout-mediator dissociation reported here is an instance of this structure: the readout subspace (probe, 2-D) and mediator subspace (DAS, 4-D) occupy d...

  20. [20]

    AUPRC skill=(AUPRC−p)/(1−p)corrects for class imbalance; failure rates (p) are90%(A),97%(B),96%(C). Tiernacc ±20% PearsonrAUROC AUPRC AUPRC skill A475 0.10 +0.69 0.58 0.94 0.40 B133 0.03 +0.06 0.59 0.98 0.27 C371 0.04 +0.31 0.62 0.98 0.46 Pooled979–+0.34 0.63 0.97 0.70 Per-tier×per-duration-bin Pearsonr: Tier≤7d8−30d31−365d>365d A−0.06(n=100)−0.16(n=125)+...

  21. [21]

    sum ofaandb

    top-1agree DAS top-1agree rand Arithmetic20 0.012±0.006 [0.000,0.001] 5−95 1.00 1.00 Trivia20 0.014±0.012 [0.000,0.001] 5−95 1.00 1.00 Natural20 0.015±0.016 [0.000,0.002] 5−95 0.85 1.00 Date task (cf. §4)332– (acc drop42→0%) ratio>10 3 0.0≈1.0 Read honestly: the DAS basis does induce measurable distributional shift beyond a random k=4 ablation on non-date...

  22. [22]

    p(σ)∝ Y i<j (σ2 i −σ 2 j )2 kY i=1 σ0 i (1−σ 2 i )(d−2k−1)/2, which is the Jacobi ensembleJ(k, k, d−k)on[0,1] k. Step 2 (mean).By standard trace-moment calculus (Collins & Matsumoto, 2009), for any k and d≥2k, E[tr(U V ⊤V U ⊤)] =E P i σ2 i =k·k/d , giving E[σ2 i ] =k/d by symmetry of the σi under the ensemble. Jensen then gives E[σi]≤ p k/d with equality ...

  23. [23]

    L * ) ⟨cos2 θ⟩(probel, probeL * ) L * = 1 Figure 13: Probe–DAS mean principal angle across all 26 layers of GEMMA2 2B at k∈ {2,4,6}

    Layer (∆fromL ⋆) CV probeR 2 (k=2) probe–DAS ¯θ(k=2) (k=4) (k=6) L=0(∆=−1)0.991 88.95 ◦ 88.83◦ 87.34◦ L=1(L ⋆)0.993 89.01 ◦ 88.42◦ 87.59◦ L=2(∆=+1)0.993 88.44 ◦ 88.40◦ 87.70◦ L=3(∆=+2)0.995 89.47 ◦ 88.47◦ 87.42◦ L=22(deep)−89.0 ◦ 89.0◦ 89.0◦ L=25(last)−89.8 ◦ 89.8◦ 89.8◦ 0 5 10 15 20 25 Layer l 0.0010 0.0015 0.0020 0.0025 0.0030⟨cos2 θi⟩ (a) Probe alignme...

  24. [24]

    In the Prop

    discovers non-basis-aligned interpretable subspaces by unsupervised feature reconstruction and validates them via causal patching. In the Prop. 3 framework, NDM subspaces should sitbetweenprobes and DAS on the readout-to- mediator spectrum: they are not task-gradient targeted (so ρk should be smaller than DAS), but they are geometry-respecting (so ρk shou...

  25. [25]

    The sum of a and b is

    report that attention outputs are low-rank across families and scales. Our Supp. S28 observation that effective mediator rank saturates around ∼6 at d∈ {1536,2304,3584} is consistent with this, and our Prop. 3 consequence — specificity grows as d/k at fixedk— makes the>500,000×ratio at7B/9B a direct prediction of attention low-rankness. S33 Supplement: op...

  26. [26]

    March 5 to June 10

    decompose per-token activations into apredictablecomponent (the projection of xt onto the subspace spanned by {x1, . . . , xt−1}) and anovelcomponent (the orthogonal residual). We test whether this decomposition explains the 88◦ readout-mediator angle—specifically, whether the mediator sits in the predictable or novel part of the activation. We evaluate t...

  27. [27]

    0 5 10 15 20 Token pos

    Event B happened o… 0 5 10 15 20 Token pos. 0 5 10 15 20 Token pos. DAS-projected cos sim −1.00 −0.75 −0.50 −0.25 0.00 0.25 0.50 0.75 1.00 Figure 18: Temporal dynamics of mediator energy ( emed, gold bars) across token positions for three Set-F duration prompts. The k/d random baseline is shown as a dashed line.Left:per-token mediator energy with token la...

  28. [28]

    what is the MLP computing?

    shows a monotone decay from the L=1 peak (26.4× Haar null) through a mid-network trough at L=18 (3.1×), followed by a secondary recovery at L=22 (6.6×). Critically, DAS energy exceeds the Haar random null atevery layer (minimum 2.1× at L=25), indicating that mediator information is carried forward through the residual stream as an additive component throu...

  29. [29]

    confidence monitor

    Boundary head (QK-twist) Relay hub (L24H2) Processing head (size ∝ AP) Flow to boundary head Relay through hub Background flow Figure 33:Attribution flow graph.Directed edges show information flow from the DAS mediator ( L⋆=1, green) through intermediate hubs to boundary heads (gold). Gold arrows: flow to boundary heads. Coral arrows: relay through L24H2 ...