pith. sign in

arxiv: 2605.00333 · v2 · pith:WVCPJ7DFnew · submitted 2026-05-01 · 💻 cs.LG · cs.CL

Borrowed Geometry: Cross-Distribution Head-Importance Fingerprints of Frozen Pretrained Gemma 4 31B

Pith reviewed 2026-05-21 00:20 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords attention headspretrained language modelsablation analysiscross-distribution transferfrozen model evaluationtoken pattern tasks
0
0 comments X

The pith

Frozen text-only Gemma model contains specific attention heads that rank highly important for both language probes and non-language pattern tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates how a pretrained language model, never exposed to non-text data during training, can still support performance on token-pattern tasks after only a thin trainable interface is added. It identifies four heads in layers 26 and 27 that consistently appear among the most important on an English TxtCopy probe and on ablation tests for binary copy, associative recall, Rule 90, and binary addition. The joint appearance of these heads across the two signals is unlikely under a hypergeometric null and holds after permutation checks. Causal zeroing of one such head produces a larger performance drop on a held-out cube task than layer-matched controls, indicating specificity.

Core claim

Within the L24-L29 slice of 192 heads, the four heads L26.28, L27.28, L27.2 and L27.3 rank as top-tier on both the TxtCopy attention probe and per-head ablation impact for the four non-language tasks; their slice-level coincidence reaches P = 0.0013 under the hypergeometric null and survives multiplicity-aware permutation testing at P_V4 = 0.013. Head-level causal ablation of L26.28 drops success on the cube-double-play task from 63.3 % to 10.0 %, a 3.2 times larger effect than a low-TxtCopy negative control.

What carries the argument

Cross-distribution head-importance fingerprint formed by joint top-tier ranking of heads on a text attention probe and on ablation impact across non-language token tasks.

If this is right

  • The frozen pretrained weights already contain structure that supports non-text pattern tasks once a thin interface is trained.
  • Pretrained Gemma reaches 60 % on the cube task while random-initialized controls remain near 1 %.
  • Zeroing L26.28 produces a larger performance drop than zeroing a layer-matched low-TxtCopy head, supplying head-level causal evidence.
  • Some tasks such as Walker2d recruit heads outside the L24-L29 slice and show weaker ablation specificity.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same heads may implement reusable pattern-matching operations that pretraining discovers even when the training distribution is purely linguistic.
  • Systematic head-ablation mapping could serve as a lightweight diagnostic for which pretrained components transfer to new modalities without full fine-tuning.
  • Extending the probe set to additional sequence or grid tasks would test whether the four-head coincidence is stable or task-dependent.

Load-bearing premise

The TxtCopy probe together with the four chosen non-language tasks serve as representative proxies for head importance that generalizes across distributions.

What would settle it

Re-running the joint-ranking analysis with a different text probe or a new set of non-language tasks that fails to recover the same four heads at comparable significance levels would falsify the claimed cross-distribution fingerprint.

Figures

Figures reproduced from arXiv: 2605.00333 by Abay Bektursun.

Figure 13
Figure 13. Figure 13: Cube-task1 substrate isolation training curves ( [PITH_FULL_IMAGE:figures/full_fig_p006_13.png] view at source ↗
Figure 1
Figure 1. Figure 1: FrozenRandom-GPT2 architecture-alone control. [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 14
Figure 14. Figure 14: OGBench cube-double-play-task1 layer ablation. [PITH_FULL_IMAGE:figures/full_fig_p007_14.png] view at source ↗
Figure 2
Figure 2. Figure 2: Cube-task1 substrate isolation, 𝑛 = 3 training curves. Pretrained Gemma L26 (purple, mean ± std) climbs to ~65% over 100K iters; pretrained L24 (red) reaches ~45%; NC1 random Gemma (gray) stays flat near zero throughout. The +59pt L26-vs-NC1 gap is the substrate-isolation result, not the absolute level. GCIQL=74% (green dashed) for absolute-performance context. L26 has bimodal seed behavior (s42=96% peak, … view at source ↗
Figure 6
Figure 6. Figure 6: Walker2d 𝑛 = 3 result. Gemma-DT D4RL normalized score: per-seed best-checkpoint trajectories (s42, s1337, s2024). All three seeds reach DT-parity; peak iterations are dispersed across the training budget. Reference lines: BC (63.9), DT 1.2M (74.0), IQL (78.3). Trainable parameters: 521K (vs DT 1.2M). Frozen-substrate compression. A layer-drop sweep at 𝑛 = 3 shows that dropping L24 (5L slice L25–L29, 2.45B … view at source ↗
Figure 3
Figure 3. Figure 3: Computational exaptation under dual measurement. [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 3
Figure 3. Figure 3: FrozenRandom-GPT2 architecture-alone control. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Head-level causal validation on OGBench cube-task1. [PITH_FULL_IMAGE:figures/full_fig_p009_4.png] view at source ↗
Figure 1
Figure 1. Figure 1: Computational exaptation. Each point is one (layer, head) × (task) critical ablation. X-axis: head’s TxtCopy probe score on 95 English sentences, as a ratio to the L24–L29 slice mean. Y-axis: head’s task-ablation impact Δ (per-bit error increase when the head is zeroed) on one of four non-language tasks. Stars mark the four named heads. L26.28 scores 3.7× slice baseline on English text (4th of 192 heads) a… view at source ↗
Figure 5
Figure 5. Figure 5: Dyck-2 plateau on frozen Gemma; cracked by matched-capacity trained transformer. [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗
Figure 4
Figure 4. Figure 4: Dyck-2 plateau on frozen Gemma; cracked by matched-capacity trained transformer. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 12
Figure 12. Figure 12: Gemma 4 31B occupies a Pareto frontier in performance vs scale. [PITH_FULL_IMAGE:figures/full_fig_p015_12.png] view at source ↗
Figure 12
Figure 12. Figure 12: Gemma 4 31B occupies a Pareto frontier in performance vs scale. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 7
Figure 7. Figure 7: Sparse, task-specific criticality. Different tasks select different dominant heads on the same physical layer L27. C.2 Single-function classification. Of 141 critical pairs at threshold, 67 (47.5%) classify as single-function above 1.5× slice baseline. TxtCopy (24 pairs), Induction (22 pairs), PrevToken (15 pairs) account for 90% of single-function classifications. Consistent with the prediction of superpo… view at source ↗
Figure 2
Figure 2. Figure 2: Length-OOD profile on CA Rule 90. Left: best-checkpoint per-bit error vs. sequence length for seven controls, trained on lengths 3–20 inclusive. Green band marks the training-edge length L20. Trained-Transformer is a matched￾capacity from-scratch trained transformer (6.36M, 1/ √ 𝑑𝑘 scaling, 𝑛 = 2 seeds, run at its best lr=1e-4); other controls at canonical lr=3e-4. Right: per-bit error ratio of each altern… view at source ↗
Figure 8
Figure 8. Figure 8: Distillation circuit-prediction. AR per-bit error vs length. DS-PROC multi-hint at s42 closes to within 24% of online Gemma at zero Gemma inference cost. E.3 Other tasks. Copy (localized at L27.28): DS-PROC single-hint L30 = 0.030 (1.2× over Student-only); multi-hint does not improve. CA R90: DS-PROC multi-hint L30 = 0.138 vs Student-only 0.171 (−19%). Addition: DS-PROC L30 = 0.067 vs Student-only 0.085 (−… view at source ↗
Figure 7
Figure 7. Figure 7: Sparse, task-specific criticality. Head-ablation impact Δ per task, normalized by task-maximum. Each panel is 6 layers × 32 heads = 192 heads. Different tasks select different dominant heads on the same physical layer L27 — L27.28 for copy and addition, L27.2 for AR, L27.3 for CA R90. C.2 47% of critical pairs have a single dominant language function. Of 141 critical pairs at threshold, 67 (47.5%) classify… view at source ↗
Figure 9
Figure 9. Figure 9: Walker2d 𝑛 = 3 result. Gemma-DT D4RL normalized score: per-seed best-checkpoint trajectories. All three seeds reach DT-parity. Reference lines: BC (63.9), DT 1.2M (74.0), IQL (78.3). Trainable: 521K. F.2 OGBench scene-play-task1 — SOTA win at 𝑛 = 3. GemmaIQL on a single layer L24 (488M frozen). 𝑛 = 3 seeds, last-3-mean = 97.33% ± 0.74 vs published GCIQL = 93%, Δ = +4.33pt. All three seeds above 96%. NC1 ra… view at source ↗
Figure 5
Figure 5. Figure 5: Distillation circuit-prediction. AR per-bit error vs. length. DS-PROC multi-hint at s42 closes to within 24% of online Gemma (1.24× ratio at L30) at zero Gemma inference cost. E.3 Other tasks — circuit-layout prediction extends. Copy (localized at L27.28): DS-PROC single-hint L30 = 0.030 (1.2× over Student-only); multi-hint does not improve. CA R90: DS-PROC multi-hint L30 = 0.138 vs Student-only 0.171 (−19… view at source ↗
Figure 10
Figure 10. Figure 10: L26 cube-task1 per-seed curves. F.4 Multi-attribute matched controls and within-layer regression on L26 (M2). To test whether L26.28’s −53.3pt drop is explained by simpler properties, we computed three attribute statistics on cube-task1 observations (𝑛 = 256 states): (a) output-projection weight norm ∥𝑊𝑂 [:, ℎ·𝑑ℎ : (ℎ + 1)𝑑ℎ] ∥𝐹, (b) per-head activation norm ∥attn_outℎ ∥2, (c) per-head activation CV. We m… view at source ↗
read the original abstract

Frozen Gemma 4 31B weights pretrained exclusively on text, unmodified, transfer through a thin trainable interface to non-text modalities the substrate has never processed. On the L24--L29 slice (192 attention heads), an English-text TxtCopy attention probe (95 sentences) and per-head ablation impact on four non-language token-pattern tasks (binary copy, associative recall, 1D cellular automaton Rule 90, binary addition) jointly classify four heads -- L26.28, L27.28, L27.2, L27.3 -- as top-tier on both signals. The slice-level joint coincidence is significant under hypergeometric null ($P = 0.0013$, $N=192$, $K=38$, $n=4$) and survives multiplicity-aware permutation tests ($P_{V4} = 0.013$). Pretrained Gemma L26 reaches 60.22% on OGBench cube-double-play-task1 vs ~1% for random-init Gemma ($+59$pt at $n=3$); a FrozenRandom-GPT2 control with correct $1/\sqrt{d_k}$ scaling also fails. Head-level causal validation: zeroing L26.28 in the trained cube-task1 IQL agent drops success $63.3\% \to 10.0\%$ vs $46.7\%$ for a layer-matched low-TxtCopy negative control ($3.2\times$ specificity at $n=30$; $n=5$ paired-$t$ $p=0.039$). A full L26 sweep places L26.28 at rank 4 of 32. Honest negatives: within-L26 Spearman $\rho(\text{TxtCopy, drop}) = +0.37$ (opposite of within-layer causal reading); single-head activation patching does not transfer the matching variable; the 4 named heads alone do not suffice on any task; Walker2d-DT and scene-task1 recruit L24 outside the named slice and show null head-ablation specificity. We frame the contribution as a cross-distribution importance fingerprint at the slice level plus head-level causal evidence on one cross-modality target.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The manuscript claims that four attention heads (L26.28, L27.28, L27.2, L27.3) in the L24--L29 slice of frozen pretrained Gemma 4 31B are jointly top-tier on an English TxtCopy probe and four non-language tasks (binary copy, associative recall, Rule 90, binary addition). The slice-level overlap is statistically significant under hypergeometric null (P=0.0013, N=192, K=38, n=4) and multiplicity-aware permutation tests (P_V4=0.013). Causal ablation of L26.28 on a cube task shows 3.2x specificity (63.3% to 10.0% drop vs. 46.7% for control, paired-t p=0.039 at n=5), with pretrained model outperforming random-init and FrozenRandom-GPT2 controls. The work reports honest negatives on other tasks and frames the result as a cross-distribution head-importance fingerprint.

Significance. If the non-language tasks probe computational demands distinct from text, the result would identify reusable attention heads in text-pretrained transformers that support transfer to other modalities without weight modification. Strengths include the use of hypergeometric and permutation tests, head-level causal ablation with controls, outperformance over random baselines, and explicit reporting of negative results on other tasks and within-layer correlations. This could inform modular transfer learning and the search for general computational primitives in large models.

major comments (2)
  1. [Abstract] Abstract: The cross-distribution fingerprint claim rests on the four non-language tasks (binary copy, associative recall, Rule 90, binary addition) probing demands distinct from the TxtCopy text probe. These tasks are all discrete sequential token-manipulation problems that structurally resemble sentence copying, so the observed head overlap and ablation effects (e.g., L26.28) may reflect shared sequential attention mechanisms rather than borrowed geometry across truly different distributions. Explicit justification or additional tasks from continuous or non-sequential modalities is required to support the central claim.
  2. [Abstract] Abstract: The four heads are identified post-hoc as the joint top performers on the TxtCopy probe and the selected non-language tasks. While the hypergeometric test reports P=0.0013, the data-dependent choice of both the heads and the task set may require a pre-specified analysis plan or adjusted multiplicity correction to confirm that the significance is not inflated by selection.
minor comments (1)
  1. [Abstract] Abstract: The layer-head notation (L26.28 etc.) should include a brief definition or pointer to the model architecture section to aid readers unfamiliar with Gemma's indexing convention.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. We address each major comment below, providing the strongest honest defense of our claims while indicating revisions where the manuscript can be strengthened without misrepresentation.

read point-by-point responses
  1. Referee: [Abstract] Abstract: The cross-distribution fingerprint claim rests on the four non-language tasks (binary copy, associative recall, Rule 90, binary addition) probing demands distinct from the TxtCopy text probe. These tasks are all discrete sequential token-manipulation problems that structurally resemble sentence copying, so the observed head overlap and ablation effects (e.g., L26.28) may reflect shared sequential attention mechanisms rather than borrowed geometry across truly different distributions. Explicit justification or additional tasks from continuous or non-sequential modalities is required to support the central claim.

    Authors: We agree that all tasks involve sequential token manipulation and thus share some structural features with sentence copying. However, the non-language tasks target distinct computational primitives not reducible to generic sequential attention: binary addition requires carry propagation and positional arithmetic absent from text copying; Rule 90 implements a specific local neighborhood transition rule from cellular automata theory; associative recall tests binding and retrieval without semantic or syntactic structure. These differences support interpreting the overlap as evidence of reusable heads for non-text distributions. We will revise the abstract and add a dedicated paragraph in the discussion to explicitly justify the task choices by contrasting their computational demands, while acknowledging the limitation that the current set remains discrete and sequential. We will also note planned extensions to continuous or non-sequential modalities in future work. revision: partial

  2. Referee: [Abstract] Abstract: The four heads are identified post-hoc as the joint top performers on the TxtCopy probe and the selected non-language tasks. While the hypergeometric test reports P=0.0013, the data-dependent choice of both the heads and the task set may require a pre-specified analysis plan or adjusted multiplicity correction to confirm that the significance is not inflated by selection.

    Authors: The four heads were selected based on joint top-tier performance, introducing a data-dependent element. The hypergeometric test evaluates overlap significance under a fixed null of random head selection within the slice, and we supplemented it with multiplicity-aware permutation tests (P_V4 = 0.013) that explicitly simulate the selection process over task sets and head rankings. The L24--L29 slice itself was chosen a priori from preliminary layer-wise importance scans. In revision we will expand the methods section to document the full analysis pipeline with upfront criteria for slice selection, task inclusion, and head ranking, and we will add a sensitivity analysis showing that the overlap remains significant under alternative task subsets. This addresses the multiplicity concern while preserving the reported statistics. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical head-importance identification

full rationale

The paper's claims rest on empirical measurements from external probes (TxtCopy on 95 sentences) and interventions (per-head ablations on binary copy, associative recall, Rule 90, binary addition), followed by a standard hypergeometric test for overlap significance (N=192, K=38, n=4, P=0.0013) and permutation checks. No mathematical derivation, parameter fitting, or prediction step reduces by construction to its own inputs. The central result is an observed coincidence plus causal specificity (3.2x on L26.28 ablation), supported by honest negatives and controls. This matches the default expectation of self-contained empirical work with no load-bearing self-citation or self-definitional reduction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Empirical ML study; no free parameters, axioms, or invented entities are introduced or fitted in the reported results.

pith-pipeline@v0.9.0 · 5943 in / 1185 out tokens · 93678 ms · 2026-05-21T00:20:13.208042+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 9 internal anchors

  1. [1]

    $\pi_0$: A Vision-Language-Action Flow Model for General Robot Control

    Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164,

  2. [2]

    Knowledge insulat- ing vision-language-action models: Train fast, run fast, generalize better.arXiv preprint arXiv:2505.23705, 2025

    Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, LucyXiaoyang Shi, and Sergey Levine. Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better. arXiv:2505.23705,

  3. [3]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. InarXiv preprint arXiv:2004.07219,

  4. [4]

    Neural Turing Machines

    Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401,

  5. [5]

    The Platonic Representation Hypothesis

    arXiv:2405.07987. Herbert Jaeger. The echo state approach to analysing and training recurrent neural networks.GMD Report 148, German National Research Center for Information Technology,

  6. [6]

    OpenVLA: An Open-Source Vision-Language-Action Model

    Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246,

  7. [7]

    Why does deep and cheap learning work so well?

    arXiv:1608.08225 (2016). 25 Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained transformers as universal computation engines. InAAAI Conference on Artificial Intelligence,

  8. [8]

    Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models

    arXiv:2403.19647. PankajMehtaandDavidJ.Schwab. Anexactmappingbetweenthevariationalrenormalizationgroupanddeeplearning. arXiv preprint arXiv:1410.3831,

  9. [9]

    Adapting pretrained transformers for tasks outside their training distribution.arXiv preprint arXiv:2108.05247,

    Aakanksha Naik and Vishwa Gupta. Adapting pretrained transformers for tasks outside their training distribution.arXiv preprint arXiv:2108.05247,

  10. [10]

    arXiv preprint arXiv:2410.20092 , year=

    arXiv:2410.20092. Physical Intelligence.𝜋 ∗ 0.6: a VLA That Learns From Experience. arXiv:2511.14759,

  11. [11]

    Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small

    URL https: //transformer-circuits.pub/2024/scaling-monosemanticity/. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. arXiv:2211.00593,

  12. [12]

    arXiv:2307.15818. 26