pith. machine review for the scientific record. sign in

arxiv: 2605.00333 · v1 · submitted 2026-05-01 · 💻 cs.LG · cs.CL

Recognition: unknown

Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities

Authors on Pith no claims yet

Pith reviewed 2026-05-09 20:27 UTC · model grok-4.3

classification 💻 cs.LG cs.CL
keywords cross-modal transferfrozen transformer weightsassociative recallrobotic manipulationparameter efficiencyattention head identificationpretrained model reuse
0
0 comments X

The pith

Frozen text-pretrained transformer weights transfer to robotic and memory tasks through a thin trainable interface without modifying the core model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper aims to establish that weights from a large transformer trained exclusively on text can remain frozen while still processing inputs from other domains such as robot actions or sequence recall. A small trainable interface converts the new inputs into a form the frozen model can handle, after which its existing computations run unchanged. This matters to a reader because it offers a route to reuse the results of costly text pretraining instead of repeating similar work for every new type of data. The experiments show gains on a robotic manipulation benchmark, parity on a locomotion task with reduced trainable parameters, and a decisive edge on an associative recall problem where an equivalent from-scratch model fails. The work further isolates particular attention heads that matter for both language probes and the new tasks.

Core claim

The central claim is that unmodified weights from a text-only pretrained Gemma 4 31B model function as a reusable substrate for non-text modalities once paired with a thin trainable interface. On OGBench scene-play tasks the setup exceeds published baselines. On D4RL Walker2d it reaches decision-transformer performance while using 0.43 times the trainable parameters by operating on a compressed 5-layer slice. On associative recall the frozen slice plus a 113K-parameter interface solves the task, whereas a from-scratch transformer at matched capacity cannot.

What carries the argument

The thin trainable interface that maps non-text inputs into the embedding space of the frozen text-pretrained transformer, after which the model's internal geometry processes the signals as usual.

Load-bearing premise

The geometry formed during text-only pretraining remains useful for processing inputs from entirely different domains without any changes to the frozen weights.

What would settle it

An experiment in which a randomly initialized transformer with the same architecture and scaling solves the robotic or associative-recall tasks at the same level as the pretrained frozen version, or in which a from-scratch model of matched capacity succeeds where the frozen setup does not.

Figures

Figures reproduced from arXiv: 2605.00333 by Abay Bektursun.

Figure 13
Figure 13. Figure 13: Cube-task1 substrate isolation training curves ( [PITH_FULL_IMAGE:figures/full_fig_p006_13.png] view at source ↗
Figure 14
Figure 14. Figure 14: OGBench cube-double-play-task1 layer ablation. [PITH_FULL_IMAGE:figures/full_fig_p007_14.png] view at source ↗
Figure 6
Figure 6. Figure 6: Walker2d 𝑛 = 3 result. Gemma-DT D4RL normalized score: per-seed best-checkpoint trajectories (s42, s1337, s2024). All three seeds reach DT-parity; peak iterations are dispersed across the training budget. Reference lines: BC (63.9), DT 1.2M (74.0), IQL (78.3). Trainable parameters: 521K (vs DT 1.2M). Frozen-substrate compression. A layer-drop sweep at 𝑛 = 3 shows that dropping L24 (5L slice L25–L29, 2.45B … view at source ↗
Figure 3
Figure 3. Figure 3: FrozenRandom-GPT2 architecture-alone control. [PITH_FULL_IMAGE:figures/full_fig_p011_3.png] view at source ↗
Figure 1
Figure 1. Figure 1: Computational exaptation. Each point is one (layer, head) × (task) critical ablation. X-axis: head’s TxtCopy probe score on 95 English sentences, as a ratio to the L24–L29 slice mean. Y-axis: head’s task-ablation impact Δ (per-bit error increase when the head is zeroed) on one of four non-language tasks. Stars mark the four named heads. L26.28 scores 3.7× slice baseline on English text (4th of 192 heads) a… view at source ↗
Figure 4
Figure 4. Figure 4: Dyck-2 plateau on frozen Gemma; cracked by matched-capacity trained transformer. [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗
Figure 12
Figure 12. Figure 12: Gemma 4 31B occupies a Pareto frontier in performance vs scale. [PITH_FULL_IMAGE:figures/full_fig_p018_12.png] view at source ↗
Figure 2
Figure 2. Figure 2: Length-OOD profile on CA Rule 90. Left: best-checkpoint per-bit error vs. sequence length for seven controls, trained on lengths 3–20 inclusive. Green band marks the training-edge length L20. Trained-Transformer is a matched￾capacity from-scratch trained transformer (6.36M, 1/ √ 𝑑𝑘 scaling, 𝑛 = 2 seeds, run at its best lr=1e-4); other controls at canonical lr=3e-4. Right: per-bit error ratio of each altern… view at source ↗
Figure 7
Figure 7. Figure 7: Sparse, task-specific criticality. Head-ablation impact Δ per task, normalized by task-maximum. Each panel is 6 layers × 32 heads = 192 heads. Different tasks select different dominant heads on the same physical layer L27 — L27.28 for copy and addition, L27.2 for AR, L27.3 for CA R90. C.2 47% of critical pairs have a single dominant language function. Of 141 critical pairs at threshold, 67 (47.5%) classify… view at source ↗
Figure 5
Figure 5. Figure 5: Distillation circuit-prediction. AR per-bit error vs. length. DS-PROC multi-hint at s42 closes to within 24% of online Gemma (1.24× ratio at L30) at zero Gemma inference cost. E.3 Other tasks — circuit-layout prediction extends. Copy (localized at L27.28): DS-PROC single-hint L30 = 0.030 (1.2× over Student-only); multi-hint does not improve. CA R90: DS-PROC multi-hint L30 = 0.138 vs Student-only 0.171 (−19… view at source ↗
read the original abstract

Frozen Gemma 4 31B weights pretrained exclusively on text tokens, unmodified, transfer across modality boundaries through a thin trainable interface. (1) OGBench scene-play-singletask-task1-v0: $+4.33$pt over published GCIQL at $n=3$ with std 0.74 -- a published-SOTA win on a robotic manipulation task the substrate has never seen. (2) D4RL Walker2d-medium-v2: Decision-Transformer parity ($76.2 \pm 0.8$, $n=3$) at $0.43\times$ DT's trainable count, with the frozen substrate compressing to a 5L slice ($+1.66$pt over the 6L baseline at $n=3$). (3) Associative recall as the cleanest pretraining-load-bearing case: the frozen slice + a 113K-parameter linear interface reaches L30 best-checkpoint per-bit error 0.0505 ($n=2$); a 6.36M-parameter from-scratch trained transformer at matched capacity ($1/\sqrt{d_k}$ scaling, two seeds, LR sweep) cannot solve the task at all under the protocol (best L30 = 0.4395), an $8.7\times$ advantage. Architecture-alone falsifications: a frozen random transformer with correct $1/\sqrt{d_k}$ scaling stays at random-chance loss for 50k steps; a random-init Gemma slice fails OGBench cube-double-play-task1 entirely (0.89% across $n=3$ where pretrained reaches 60%). A dual-measurement protocol -- text-activation probing on 95 English sentences plus task-ablation on a non-language target -- names individual heads independently identifiable on both protocols: head L26.28 scores $3.7\times$ the slice mean for English token-copying and is the #2 most-critical head for binary copy ablation ($\Delta$ L30 $= +0.221$); three further heads (L27.28, L27.2, L27.3) classify by the same protocol. The mechanism is single-model and the cross-modality results are single-task within their respective benchmarks; cross-model replication is structurally constrained because Gemma 4 31B is the only model on the small-scale Pareto frontier as of April 2026.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 3 minor

Summary. The paper claims that unmodified frozen weights from a text-pretrained transformer (Gemma 4 31B) transfer to non-text modalities via a thin trainable interface. It reports a +4.33pt gain over GCIQL on OGBench scene-play-singletask (n=3), Decision-Transformer parity on D4RL Walker2d-medium-v2 at 0.43x trainable parameters using a 5-layer slice, and an 8.7x error reduction on associative recall (0.0505 vs 0.4395) versus a matched-capacity from-scratch transformer. Controls include random-weight baselines, random-init Gemma slices, and head-level dual-protocol ablations linking text token-copying to task-critical heads.

Significance. If the results hold, the work provides evidence that text-only pretraining encodes reusable geometric structures transferable across modalities without core weight modification, supporting more efficient cross-modal reuse. Strengths include multiple internal falsification controls (random transformers at chance, from-scratch failure, architecture ablations) and head-specific measurements that tie English probing to non-language task performance. These elements make the transfer claim more falsifiable than typical empirical transfer papers.

major comments (2)
  1. [OGBench experiments] OGBench results paragraph: the +4.33pt gain over published GCIQL is reported with n=3 and std=0.74; without a statistical test (e.g., paired t-test or bootstrap CI) this does not yet establish a reliable SOTA win, as the interval overlaps plausible noise.
  2. [Associative recall] Associative recall section: the from-scratch baseline is stated as 'at matched capacity' with 1/√d_k scaling and LR sweep, yet the frozen 5-layer slice is extracted from a 31B model while the interface is only 113K parameters; clarify the precise capacity metric used for matching beyond trainable count.
minor comments (3)
  1. [Abstract] Abstract: the model is referred to as 'Gemma 4 31B'; the main text should state the exact checkpoint identifier and release date for reproducibility.
  2. [Head-level analysis] Head ablation protocol: the dual-measurement (English token-copying + task ablation) identifies heads such as L26.28; a supplementary table listing all heads' scores on both protocols would improve clarity.
  3. [D4RL experiments] D4RL paragraph: the 5L slice is reported as +1.66pt over a 6L baseline; confirm whether the 6L baseline also uses frozen weights or is fully trainable.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments and the recommendation of minor revision. We address each major comment below.

read point-by-point responses
  1. Referee: [OGBench experiments] OGBench results paragraph: the +4.33pt gain over published GCIQL is reported with n=3 and std=0.74; without a statistical test (e.g., paired t-test or bootstrap CI) this does not yet establish a reliable SOTA win, as the interval overlaps plausible noise.

    Authors: We agree that a formal statistical test is needed to substantiate the SOTA claim. In the revised manuscript we will add a bootstrap confidence interval (computed over the n=3 runs) for the OGBench scene-play-singletask comparison. Given the reported mean difference of +4.33 and standard deviation of 0.74, the interval is expected to exclude zero, but we will report the exact CI and p-value so readers can assess reliability directly. revision: yes

  2. Referee: [Associative recall] Associative recall section: the from-scratch baseline is stated as 'at matched capacity' with 1/√d_k scaling and LR sweep, yet the frozen 5-layer slice is extracted from a 31B model while the interface is only 113K parameters; clarify the precise capacity metric used for matching beyond trainable count.

    Authors: We will revise the text to state explicitly that capacity matching is performed on (i) the number of trainable parameters (113 K interface vs. 6.36 M full from-scratch transformer) and (ii) the architectural dimensions of the trainable component (6-layer transformer whose d_model and d_k match those of the 5-layer Gemma slice). The 1/√d_k initialization and LR sweep were applied only to the from-scratch model. We acknowledge that the frozen 31 B weights supply additional representational capacity unavailable to the from-scratch baseline; the experiment is intentionally designed to isolate the value of that pre-trained geometry under a minimal trainable interface. The revised paragraph will make this distinction clear while preserving the reported result that the from-scratch model fails to solve the task. revision: yes

Circularity Check

0 steps flagged

No significant circularity identified

full rationale

The paper's central claims rest on empirical performance comparisons across modalities using frozen text-pretrained weights, with explicit falsification controls (random-init Gemma slice, frozen random transformer at chance, from-scratch transformer at matched capacity failing associative recall, and head-level dual-protocol ablations). No mathematical derivations, predictions, or first-principles results are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The transferability assumption is addressed through direct internal evidence rather than untested premises or renamed known results. The work is self-contained against the provided benchmarks and controls.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Abstract provides no explicit free parameters or invented entities; the central claim rests on one domain assumption about transferable geometry.

axioms (1)
  • domain assumption Representations learned from text-only pretraining contain geometry that is useful for non-text modalities without core weight changes.
    Invoked directly in the claim that unmodified frozen weights transfer via a thin interface.

pith-pipeline@v0.9.0 · 5751 in / 1102 out tokens · 23931 ms · 2026-05-09T20:27:37.081861+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

20 extracted references · 7 canonical work pages · 2 internal anchors

  1. [1]

    Decision transformer: Reinforcement learning via sequence modeling

    Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, 2021

  2. [2]

    A mathematical framework for transformer circuits

    Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021

  3. [3]

    Toy models of superposition

    Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, et al. Toy models of superposition. Transformer Circuits Thread, 2022

  4. [4]

    The lottery ticket hypothesis: Finding sparse, trainable neural networks

    Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019

  5. [5]

    D4RL: Datasets for Deep Data-Driven Reinforcement Learning

    Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. In arXiv preprint arXiv:2004.07219, 2020

  6. [6]

    Gemma 4: Frontier multimodal intelligence on device

    Gemma Team . Gemma 4: Frontier multimodal intelligence on device. Google DeepMind. https://deepmind.google/models/gemma/gemma-4/, 2026. Released April 2026. Open weights under Apache 2.0

  7. [7]

    Stephen Jay Gould and Elisabeth S. Vrba. Exaptation---a missing term in the science of form. Paleobiology, 8 0 (1): 0 4--15, 1982

  8. [8]

    Neural Turing Machines

    Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014

  9. [9]

    The Platonic Representation Hypothesis

    Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In International Conference on Machine Learning (ICML), 2024. arXiv:2405.07987

  10. [10]

    The echo state approach to analysing and training recurrent neural networks

    Herbert Jaeger. The echo state approach to analysing and training recurrent neural networks. GMD Report 148, German National Research Center for Information Technology, 2001

  11. [11]

    Offline reinforcement learning with implicit q-learning

    Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022

  12. [12]

    Conservative q-learning for offline reinforcement learning

    Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020

  13. [13]

    Lin, Max Tegmark, and David Rolnick

    Henry W. Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168: 0 1223--1247, 2017. arXiv:1608.08225 (2016)

  14. [14]

    Pretrained transformers as universal computation engines

    Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained transformers as universal computation engines. In AAAI Conference on Artificial Intelligence, 2022

  15. [15]

    Real-time computing without stable states: A new framework for neural computation based on perturbations

    Wolfgang Maass, Thomas Natschl \"a ger, and Henry Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14 0 (11): 0 2531--2560, 2002

  16. [16]

    Pankaj Mehta and David J. Schwab. An exact mapping between the variational renormalization group and deep learning. arXiv preprint arXiv:1410.3831, 2014

  17. [17]

    Adapting pretrained transformers for tasks outside their training distribution

    Aakanksha Naik and Vishwa Gupta. Adapting pretrained transformers for tasks outside their training distribution. arXiv preprint arXiv:2108.05247, 2021

  18. [18]

    In-context learning and induction heads

    Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. Transformer Circuits Thread, 2022

  19. [19]

    Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092,

    Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. In International Conference on Learning Representations (ICLR), 2025. arXiv:2410.20092

  20. [20]

    Rumelhart, James L

    David E. Rumelhart, James L. McClelland, and PDP Research Group . Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, 1986