Borrowed Geometry: Cross-Distribution Head-Importance Fingerprints of Frozen Pretrained Gemma 4 31B
Pith reviewed 2026-05-21 00:20 UTC · model grok-4.3
The pith
Frozen text-only Gemma model contains specific attention heads that rank highly important for both language probes and non-language pattern tasks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Within the L24-L29 slice of 192 heads, the four heads L26.28, L27.28, L27.2 and L27.3 rank as top-tier on both the TxtCopy attention probe and per-head ablation impact for the four non-language tasks; their slice-level coincidence reaches P = 0.0013 under the hypergeometric null and survives multiplicity-aware permutation testing at P_V4 = 0.013. Head-level causal ablation of L26.28 drops success on the cube-double-play task from 63.3 % to 10.0 %, a 3.2 times larger effect than a low-TxtCopy negative control.
What carries the argument
Cross-distribution head-importance fingerprint formed by joint top-tier ranking of heads on a text attention probe and on ablation impact across non-language token tasks.
If this is right
- The frozen pretrained weights already contain structure that supports non-text pattern tasks once a thin interface is trained.
- Pretrained Gemma reaches 60 % on the cube task while random-initialized controls remain near 1 %.
- Zeroing L26.28 produces a larger performance drop than zeroing a layer-matched low-TxtCopy head, supplying head-level causal evidence.
- Some tasks such as Walker2d recruit heads outside the L24-L29 slice and show weaker ablation specificity.
Where Pith is reading between the lines
- The same heads may implement reusable pattern-matching operations that pretraining discovers even when the training distribution is purely linguistic.
- Systematic head-ablation mapping could serve as a lightweight diagnostic for which pretrained components transfer to new modalities without full fine-tuning.
- Extending the probe set to additional sequence or grid tasks would test whether the four-head coincidence is stable or task-dependent.
Load-bearing premise
The TxtCopy probe together with the four chosen non-language tasks serve as representative proxies for head importance that generalizes across distributions.
What would settle it
Re-running the joint-ranking analysis with a different text probe or a new set of non-language tasks that fails to recover the same four heads at comparable significance levels would falsify the claimed cross-distribution fingerprint.
Figures
read the original abstract
Frozen Gemma 4 31B weights pretrained exclusively on text, unmodified, transfer through a thin trainable interface to non-text modalities the substrate has never processed. On the L24--L29 slice (192 attention heads), an English-text TxtCopy attention probe (95 sentences) and per-head ablation impact on four non-language token-pattern tasks (binary copy, associative recall, 1D cellular automaton Rule 90, binary addition) jointly classify four heads -- L26.28, L27.28, L27.2, L27.3 -- as top-tier on both signals. The slice-level joint coincidence is significant under hypergeometric null ($P = 0.0013$, $N=192$, $K=38$, $n=4$) and survives multiplicity-aware permutation tests ($P_{V4} = 0.013$). Pretrained Gemma L26 reaches 60.22% on OGBench cube-double-play-task1 vs ~1% for random-init Gemma ($+59$pt at $n=3$); a FrozenRandom-GPT2 control with correct $1/\sqrt{d_k}$ scaling also fails. Head-level causal validation: zeroing L26.28 in the trained cube-task1 IQL agent drops success $63.3\% \to 10.0\%$ vs $46.7\%$ for a layer-matched low-TxtCopy negative control ($3.2\times$ specificity at $n=30$; $n=5$ paired-$t$ $p=0.039$). A full L26 sweep places L26.28 at rank 4 of 32. Honest negatives: within-L26 Spearman $\rho(\text{TxtCopy, drop}) = +0.37$ (opposite of within-layer causal reading); single-head activation patching does not transfer the matching variable; the 4 named heads alone do not suffice on any task; Walker2d-DT and scene-task1 recruit L24 outside the named slice and show null head-ablation specificity. We frame the contribution as a cross-distribution importance fingerprint at the slice level plus head-level causal evidence on one cross-modality target.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript claims that four attention heads (L26.28, L27.28, L27.2, L27.3) in the L24--L29 slice of frozen pretrained Gemma 4 31B are jointly top-tier on an English TxtCopy probe and four non-language tasks (binary copy, associative recall, Rule 90, binary addition). The slice-level overlap is statistically significant under hypergeometric null (P=0.0013, N=192, K=38, n=4) and multiplicity-aware permutation tests (P_V4=0.013). Causal ablation of L26.28 on a cube task shows 3.2x specificity (63.3% to 10.0% drop vs. 46.7% for control, paired-t p=0.039 at n=5), with pretrained model outperforming random-init and FrozenRandom-GPT2 controls. The work reports honest negatives on other tasks and frames the result as a cross-distribution head-importance fingerprint.
Significance. If the non-language tasks probe computational demands distinct from text, the result would identify reusable attention heads in text-pretrained transformers that support transfer to other modalities without weight modification. Strengths include the use of hypergeometric and permutation tests, head-level causal ablation with controls, outperformance over random baselines, and explicit reporting of negative results on other tasks and within-layer correlations. This could inform modular transfer learning and the search for general computational primitives in large models.
major comments (2)
- [Abstract] Abstract: The cross-distribution fingerprint claim rests on the four non-language tasks (binary copy, associative recall, Rule 90, binary addition) probing demands distinct from the TxtCopy text probe. These tasks are all discrete sequential token-manipulation problems that structurally resemble sentence copying, so the observed head overlap and ablation effects (e.g., L26.28) may reflect shared sequential attention mechanisms rather than borrowed geometry across truly different distributions. Explicit justification or additional tasks from continuous or non-sequential modalities is required to support the central claim.
- [Abstract] Abstract: The four heads are identified post-hoc as the joint top performers on the TxtCopy probe and the selected non-language tasks. While the hypergeometric test reports P=0.0013, the data-dependent choice of both the heads and the task set may require a pre-specified analysis plan or adjusted multiplicity correction to confirm that the significance is not inflated by selection.
minor comments (1)
- [Abstract] Abstract: The layer-head notation (L26.28 etc.) should include a brief definition or pointer to the model architecture section to aid readers unfamiliar with Gemma's indexing convention.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. We address each major comment below, providing the strongest honest defense of our claims while indicating revisions where the manuscript can be strengthened without misrepresentation.
read point-by-point responses
-
Referee: [Abstract] Abstract: The cross-distribution fingerprint claim rests on the four non-language tasks (binary copy, associative recall, Rule 90, binary addition) probing demands distinct from the TxtCopy text probe. These tasks are all discrete sequential token-manipulation problems that structurally resemble sentence copying, so the observed head overlap and ablation effects (e.g., L26.28) may reflect shared sequential attention mechanisms rather than borrowed geometry across truly different distributions. Explicit justification or additional tasks from continuous or non-sequential modalities is required to support the central claim.
Authors: We agree that all tasks involve sequential token manipulation and thus share some structural features with sentence copying. However, the non-language tasks target distinct computational primitives not reducible to generic sequential attention: binary addition requires carry propagation and positional arithmetic absent from text copying; Rule 90 implements a specific local neighborhood transition rule from cellular automata theory; associative recall tests binding and retrieval without semantic or syntactic structure. These differences support interpreting the overlap as evidence of reusable heads for non-text distributions. We will revise the abstract and add a dedicated paragraph in the discussion to explicitly justify the task choices by contrasting their computational demands, while acknowledging the limitation that the current set remains discrete and sequential. We will also note planned extensions to continuous or non-sequential modalities in future work. revision: partial
-
Referee: [Abstract] Abstract: The four heads are identified post-hoc as the joint top performers on the TxtCopy probe and the selected non-language tasks. While the hypergeometric test reports P=0.0013, the data-dependent choice of both the heads and the task set may require a pre-specified analysis plan or adjusted multiplicity correction to confirm that the significance is not inflated by selection.
Authors: The four heads were selected based on joint top-tier performance, introducing a data-dependent element. The hypergeometric test evaluates overlap significance under a fixed null of random head selection within the slice, and we supplemented it with multiplicity-aware permutation tests (P_V4 = 0.013) that explicitly simulate the selection process over task sets and head rankings. The L24--L29 slice itself was chosen a priori from preliminary layer-wise importance scans. In revision we will expand the methods section to document the full analysis pipeline with upfront criteria for slice selection, task inclusion, and head ranking, and we will add a sensitivity analysis showing that the overlap remains significant under alternative task subsets. This addresses the multiplicity concern while preserving the reported statistics. revision: yes
Circularity Check
No circularity in empirical head-importance identification
full rationale
The paper's claims rest on empirical measurements from external probes (TxtCopy on 95 sentences) and interventions (per-head ablations on binary copy, associative recall, Rule 90, binary addition), followed by a standard hypergeometric test for overlap significance (N=192, K=38, n=4, P=0.0013) and permutation checks. No mathematical derivation, parameter fitting, or prediction step reduces by construction to its own inputs. The central result is an observed coincidence plus causal specificity (3.2x on L26.28 ablation), supported by honest negatives and controls. This matches the default expectation of self-contained empirical work with no load-bearing self-citation or self-definitional reduction.
Axiom & Free-Parameter Ledger
Reference graph
Works this paper leans on
-
[1]
$\pi_0$: A Vision-Language-Action Flow Model for General Robot Control
Kevin Black, Noah Brown, Danny Driess, Adnan Esmail, Michael Equi, Chelsea Finn, Niccolo Fusai, Lachy Groom, Karol Hausman, Brian Ichter, et al. 𝜋0: A Vision-Language-Action Flow Model for General Robot Control. arXiv:2410.24164,
work page internal anchor Pith review Pith/arXiv arXiv
-
[2]
Danny Driess, Jost Tobias Springenberg, Brian Ichter, Lili Yu, Adrian Li-Bell, Karl Pertsch, Allen Z. Ren, Homer Walke, Quan Vuong, LucyXiaoyang Shi, and Sergey Levine. Knowledge Insulating Vision-Language-Action Models: Train Fast, Run Fast, Generalize Better. arXiv:2505.23705,
-
[3]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. InarXiv preprint arXiv:2004.07219,
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[4]
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401,
work page internal anchor Pith review Pith/arXiv arXiv
-
[5]
The Platonic Representation Hypothesis
arXiv:2405.07987. Herbert Jaeger. The echo state approach to analysing and training recurrent neural networks.GMD Report 148, German National Research Center for Information Technology,
work page internal anchor Pith review Pith/arXiv arXiv
-
[6]
OpenVLA: An Open-Source Vision-Language-Action Model
Moo Jin Kim, Karl Pertsch, Siddharth Karamcheti, Ted Xiao, Ashwin Balakrishna, Suraj Nair, Rafael Rafailov, Ethan Foster, Grace Lam, Pannag Sanketi, Quan Vuong, Thomas Kollar, Benjamin Burchfiel, Russ Tedrake, Dorsa Sadigh, Sergey Levine, Percy Liang, and Chelsea Finn. OpenVLA: An Open-Source Vision-Language-Action Model. arXiv:2406.09246,
work page internal anchor Pith review Pith/arXiv arXiv
-
[7]
Why does deep and cheap learning work so well?
arXiv:1608.08225 (2016). 25 Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained transformers as universal computation engines. InAAAI Conference on Artificial Intelligence,
work page internal anchor Pith review Pith/arXiv arXiv 2016
-
[8]
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
arXiv:2403.19647. PankajMehtaandDavidJ.Schwab. Anexactmappingbetweenthevariationalrenormalizationgroupanddeeplearning. arXiv preprint arXiv:1410.3831,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
Aakanksha Naik and Vishwa Gupta. Adapting pretrained transformers for tasks outside their training distribution.arXiv preprint arXiv:2108.05247,
-
[10]
arXiv preprint arXiv:2410.20092 , year=
arXiv:2410.20092. Physical Intelligence.𝜋 ∗ 0.6: a VLA That Learns From Experience. arXiv:2511.14759,
-
[11]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
URL https: //transformer-circuits.pub/2024/scaling-monosemanticity/. Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small. arXiv:2211.00593,
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[12]
arXiv:2307.15818. 26
work page internal anchor Pith review Pith/arXiv arXiv
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.