Recognition: unknown
Borrowed Geometry: Computational Reuse of Frozen Text-Pretrained Transformer Weights Across Modalities
Pith reviewed 2026-05-09 20:27 UTC · model grok-4.3
The pith
Frozen text-pretrained transformer weights transfer to robotic and memory tasks through a thin trainable interface without modifying the core model.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The central claim is that unmodified weights from a text-only pretrained Gemma 4 31B model function as a reusable substrate for non-text modalities once paired with a thin trainable interface. On OGBench scene-play tasks the setup exceeds published baselines. On D4RL Walker2d it reaches decision-transformer performance while using 0.43 times the trainable parameters by operating on a compressed 5-layer slice. On associative recall the frozen slice plus a 113K-parameter interface solves the task, whereas a from-scratch transformer at matched capacity cannot.
What carries the argument
The thin trainable interface that maps non-text inputs into the embedding space of the frozen text-pretrained transformer, after which the model's internal geometry processes the signals as usual.
Load-bearing premise
The geometry formed during text-only pretraining remains useful for processing inputs from entirely different domains without any changes to the frozen weights.
What would settle it
An experiment in which a randomly initialized transformer with the same architecture and scaling solves the robotic or associative-recall tasks at the same level as the pretrained frozen version, or in which a from-scratch model of matched capacity succeeds where the frozen setup does not.
Figures
read the original abstract
Frozen Gemma 4 31B weights pretrained exclusively on text tokens, unmodified, transfer across modality boundaries through a thin trainable interface. (1) OGBench scene-play-singletask-task1-v0: $+4.33$pt over published GCIQL at $n=3$ with std 0.74 -- a published-SOTA win on a robotic manipulation task the substrate has never seen. (2) D4RL Walker2d-medium-v2: Decision-Transformer parity ($76.2 \pm 0.8$, $n=3$) at $0.43\times$ DT's trainable count, with the frozen substrate compressing to a 5L slice ($+1.66$pt over the 6L baseline at $n=3$). (3) Associative recall as the cleanest pretraining-load-bearing case: the frozen slice + a 113K-parameter linear interface reaches L30 best-checkpoint per-bit error 0.0505 ($n=2$); a 6.36M-parameter from-scratch trained transformer at matched capacity ($1/\sqrt{d_k}$ scaling, two seeds, LR sweep) cannot solve the task at all under the protocol (best L30 = 0.4395), an $8.7\times$ advantage. Architecture-alone falsifications: a frozen random transformer with correct $1/\sqrt{d_k}$ scaling stays at random-chance loss for 50k steps; a random-init Gemma slice fails OGBench cube-double-play-task1 entirely (0.89% across $n=3$ where pretrained reaches 60%). A dual-measurement protocol -- text-activation probing on 95 English sentences plus task-ablation on a non-language target -- names individual heads independently identifiable on both protocols: head L26.28 scores $3.7\times$ the slice mean for English token-copying and is the #2 most-critical head for binary copy ablation ($\Delta$ L30 $= +0.221$); three further heads (L27.28, L27.2, L27.3) classify by the same protocol. The mechanism is single-model and the cross-modality results are single-task within their respective benchmarks; cross-model replication is structurally constrained because Gemma 4 31B is the only model on the small-scale Pareto frontier as of April 2026.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that unmodified frozen weights from a text-pretrained transformer (Gemma 4 31B) transfer to non-text modalities via a thin trainable interface. It reports a +4.33pt gain over GCIQL on OGBench scene-play-singletask (n=3), Decision-Transformer parity on D4RL Walker2d-medium-v2 at 0.43x trainable parameters using a 5-layer slice, and an 8.7x error reduction on associative recall (0.0505 vs 0.4395) versus a matched-capacity from-scratch transformer. Controls include random-weight baselines, random-init Gemma slices, and head-level dual-protocol ablations linking text token-copying to task-critical heads.
Significance. If the results hold, the work provides evidence that text-only pretraining encodes reusable geometric structures transferable across modalities without core weight modification, supporting more efficient cross-modal reuse. Strengths include multiple internal falsification controls (random transformers at chance, from-scratch failure, architecture ablations) and head-specific measurements that tie English probing to non-language task performance. These elements make the transfer claim more falsifiable than typical empirical transfer papers.
major comments (2)
- [OGBench experiments] OGBench results paragraph: the +4.33pt gain over published GCIQL is reported with n=3 and std=0.74; without a statistical test (e.g., paired t-test or bootstrap CI) this does not yet establish a reliable SOTA win, as the interval overlaps plausible noise.
- [Associative recall] Associative recall section: the from-scratch baseline is stated as 'at matched capacity' with 1/√d_k scaling and LR sweep, yet the frozen 5-layer slice is extracted from a 31B model while the interface is only 113K parameters; clarify the precise capacity metric used for matching beyond trainable count.
minor comments (3)
- [Abstract] Abstract: the model is referred to as 'Gemma 4 31B'; the main text should state the exact checkpoint identifier and release date for reproducibility.
- [Head-level analysis] Head ablation protocol: the dual-measurement (English token-copying + task ablation) identifies heads such as L26.28; a supplementary table listing all heads' scores on both protocols would improve clarity.
- [D4RL experiments] D4RL paragraph: the 5L slice is reported as +1.66pt over a 6L baseline; confirm whether the 6L baseline also uses frozen weights or is fully trainable.
Simulated Author's Rebuttal
We thank the referee for the constructive comments and the recommendation of minor revision. We address each major comment below.
read point-by-point responses
-
Referee: [OGBench experiments] OGBench results paragraph: the +4.33pt gain over published GCIQL is reported with n=3 and std=0.74; without a statistical test (e.g., paired t-test or bootstrap CI) this does not yet establish a reliable SOTA win, as the interval overlaps plausible noise.
Authors: We agree that a formal statistical test is needed to substantiate the SOTA claim. In the revised manuscript we will add a bootstrap confidence interval (computed over the n=3 runs) for the OGBench scene-play-singletask comparison. Given the reported mean difference of +4.33 and standard deviation of 0.74, the interval is expected to exclude zero, but we will report the exact CI and p-value so readers can assess reliability directly. revision: yes
-
Referee: [Associative recall] Associative recall section: the from-scratch baseline is stated as 'at matched capacity' with 1/√d_k scaling and LR sweep, yet the frozen 5-layer slice is extracted from a 31B model while the interface is only 113K parameters; clarify the precise capacity metric used for matching beyond trainable count.
Authors: We will revise the text to state explicitly that capacity matching is performed on (i) the number of trainable parameters (113 K interface vs. 6.36 M full from-scratch transformer) and (ii) the architectural dimensions of the trainable component (6-layer transformer whose d_model and d_k match those of the 5-layer Gemma slice). The 1/√d_k initialization and LR sweep were applied only to the from-scratch model. We acknowledge that the frozen 31 B weights supply additional representational capacity unavailable to the from-scratch baseline; the experiment is intentionally designed to isolate the value of that pre-trained geometry under a minimal trainable interface. The revised paragraph will make this distinction clear while preserving the reported result that the from-scratch model fails to solve the task. revision: yes
Circularity Check
No significant circularity identified
full rationale
The paper's central claims rest on empirical performance comparisons across modalities using frozen text-pretrained weights, with explicit falsification controls (random-init Gemma slice, frozen random transformer at chance, from-scratch transformer at matched capacity failing associative recall, and head-level dual-protocol ablations). No mathematical derivations, predictions, or first-principles results are presented that reduce by construction to fitted parameters, self-definitions, or self-citation chains. The transferability assumption is addressed through direct internal evidence rather than untested premises or renamed known results. The work is self-contained against the provided benchmarks and controls.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Representations learned from text-only pretraining contain geometry that is useful for non-text modalities without core weight changes.
Reference graph
Works this paper leans on
-
[1]
Decision transformer: Reinforcement learning via sequence modeling
Lili Chen, Kevin Lu, Aravind Rajeswaran, Kimin Lee, Aditya Grover, Misha Laskin, Pieter Abbeel, Aravind Srinivas, and Igor Mordatch. Decision transformer: Reinforcement learning via sequence modeling. In Advances in Neural Information Processing Systems, 2021
2021
-
[2]
A mathematical framework for transformer circuits
Nelson Elhage, Neel Nanda, Catherine Olsson, Tom Henighan, Nicholas Joseph, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, Tom Conerly, et al. A mathematical framework for transformer circuits. Transformer Circuits Thread, 2021
2021
-
[3]
Toy models of superposition
Nelson Elhage, Tristan Hume, Catherine Olsson, Nicholas Schiefer, Tom Henighan, et al. Toy models of superposition. Transformer Circuits Thread, 2022
2022
-
[4]
The lottery ticket hypothesis: Finding sparse, trainable neural networks
Jonathan Frankle and Michael Carbin. The lottery ticket hypothesis: Finding sparse, trainable neural networks. In International Conference on Learning Representations (ICLR), 2019
2019
-
[5]
D4RL: Datasets for Deep Data-Driven Reinforcement Learning
Justin Fu, Aviral Kumar, Ofir Nachum, George Tucker, and Sergey Levine. D4rl: Datasets for deep data-driven reinforcement learning. In arXiv preprint arXiv:2004.07219, 2020
work page internal anchor Pith review arXiv 2004
-
[6]
Gemma 4: Frontier multimodal intelligence on device
Gemma Team . Gemma 4: Frontier multimodal intelligence on device. Google DeepMind. https://deepmind.google/models/gemma/gemma-4/, 2026. Released April 2026. Open weights under Apache 2.0
2026
-
[7]
Stephen Jay Gould and Elisabeth S. Vrba. Exaptation---a missing term in the science of form. Paleobiology, 8 0 (1): 0 4--15, 1982
1982
-
[8]
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014
work page internal anchor Pith review arXiv 2014
-
[9]
The Platonic Representation Hypothesis
Minyoung Huh, Brian Cheung, Tongzhou Wang, and Phillip Isola. The platonic representation hypothesis. In International Conference on Machine Learning (ICML), 2024. arXiv:2405.07987
work page Pith review arXiv 2024
-
[10]
The echo state approach to analysing and training recurrent neural networks
Herbert Jaeger. The echo state approach to analysing and training recurrent neural networks. GMD Report 148, German National Research Center for Information Technology, 2001
2001
-
[11]
Offline reinforcement learning with implicit q-learning
Ilya Kostrikov, Ashvin Nair, and Sergey Levine. Offline reinforcement learning with implicit q-learning. In International Conference on Learning Representations, 2022
2022
-
[12]
Conservative q-learning for offline reinforcement learning
Aviral Kumar, Aurick Zhou, George Tucker, and Sergey Levine. Conservative q-learning for offline reinforcement learning. In Advances in Neural Information Processing Systems, 2020
2020
-
[13]
Lin, Max Tegmark, and David Rolnick
Henry W. Lin, Max Tegmark, and David Rolnick. Why does deep and cheap learning work so well? Journal of Statistical Physics, 168: 0 1223--1247, 2017. arXiv:1608.08225 (2016)
-
[14]
Pretrained transformers as universal computation engines
Kevin Lu, Aditya Grover, Pieter Abbeel, and Igor Mordatch. Pretrained transformers as universal computation engines. In AAAI Conference on Artificial Intelligence, 2022
2022
-
[15]
Real-time computing without stable states: A new framework for neural computation based on perturbations
Wolfgang Maass, Thomas Natschl \"a ger, and Henry Markram. Real-time computing without stable states: A new framework for neural computation based on perturbations. Neural Computation, 14 0 (11): 0 2531--2560, 2002
2002
-
[16]
Pankaj Mehta and David J. Schwab. An exact mapping between the variational renormalization group and deep learning. arXiv preprint arXiv:1410.3831, 2014
work page Pith review arXiv 2014
-
[17]
Adapting pretrained transformers for tasks outside their training distribution
Aakanksha Naik and Vishwa Gupta. Adapting pretrained transformers for tasks outside their training distribution. arXiv preprint arXiv:2108.05247, 2021
-
[18]
In-context learning and induction heads
Catherine Olsson, Nelson Elhage, Neel Nanda, Nicholas Joseph, Nova DasSarma, Tom Henighan, Ben Mann, Amanda Askell, Yuntao Bai, Anna Chen, et al. In-context learning and induction heads. Transformer Circuits Thread, 2022
2022
-
[19]
Ogbench: Benchmarking offline goal-conditioned rl.arXiv preprint arXiv:2410.20092,
Seohong Park, Kevin Frans, Benjamin Eysenbach, and Sergey Levine. Ogbench: Benchmarking offline goal-conditioned rl. In International Conference on Learning Representations (ICLR), 2025. arXiv:2410.20092
-
[20]
Rumelhart, James L
David E. Rumelhart, James L. McClelland, and PDP Research Group . Parallel Distributed Processing: Explorations in the Microstructure of Cognition. MIT Press, 1986
1986
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.