Recognition: unknown
Graph Memory Transformer (GMT)
Pith reviewed 2026-05-08 06:24 UTC · model grok-4.3
The pith
A learned graph of memory centroids can replace the feed-forward sublayer in a decoder-only transformer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Graph Memory Transformer (GMT) keeps causal self-attention unchanged but replaces every FFN with a memory cell containing 128 centroids and a 128 by 128 learned transition matrix. Token representations select a source centroid through gravitational routing, choose a target based on the current token, and read out a gated displacement that moves the representation from source toward target. Each of the 16 blocks performs this navigation rather than a standard linear transformation, producing an 82.2 million parameter decoder-only language model whose memory operations remain directly inspectable during the forward pass.
What carries the argument
The memory cell that performs gravitational source routing, token-conditioned target selection, and gated displacement readout to compute movement between centroids instead of a dense feed-forward transformation.
Load-bearing premise
Gravitational source routing combined with token-conditioned target selection and gated displacement readout can match the computational role of a dense feed-forward network using only the memory cell's own parameters.
What would settle it
Training the GMT model on the same data and observing that it diverges or produces incoherent text on basic continuation tasks would show that the memory graph cannot substitute for the FFN.
Figures
read the original abstract
We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes replacing the FFN sublayers in decoder-only transformers with a Graph Memory Transformer (GMT) cell that routes token representations over a learned bank of 128 centroids per block connected by a 128x128 directed transition matrix, using gravitational source routing, token-conditioned target selection, and gated displacement readout to return source-to-target movements rather than dense transformations. The base GMT v7 model (82.2M parameters, 16 blocks) trains stably, exposes centroid usage, transitions, and movements as inspectable forward-pass quantities, achieves validation loss/perplexity of 3.5995/36.58 (vs. 3.2903/26.85 for a 103M dense GPT-style baseline), and shows comparable zero-shot benchmark behavior, supporting the viability of graph-mediated memory navigation as an FFN substitute without claiming SOTA results.
Significance. If the substitution holds under further scrutiny, the approach could improve interpretability by making memory operations explicit and directly analyzable, while using fewer parameters than the dense baseline. Stable training and the inspectable quantities are concrete strengths that enable new analyses of internal dynamics. The performance gap and lack of scaling results limit immediate impact, but the work provides a foundation for memory-graph alternatives to opaque FFNs.
major comments (2)
- [Experimental evaluation / Results] The central viability claim—that gravitational source routing plus token-conditioned target selection plus gated displacement readout functionally substitutes for the dense FFN without extra capacity or changes outside the memory cell—lacks supporting ablations. No experiments disable or randomize the routing/readout components while holding total parameter count fixed at 82.2M (or compare against a generic low-rank memory bank), so it remains possible that any centroid bank would produce similar results and that the graph structure is not load-bearing.
- [Results] Table or results section reporting validation metrics: the GMT trails the dense baseline by ~0.3 nats / ~10 perplexity points, yet no error bars, multiple random seeds, or capacity-matched dense baseline (e.g., 82M-parameter dense model) are provided. This weakens the ability to attribute the gap specifically to the architectural substitution rather than capacity or optimization differences.
minor comments (1)
- [Abstract] The abstract states 'close zero-shot benchmark behavior' without naming the specific benchmarks or reporting exact scores, which would strengthen the claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive note on the interpretability potential of GMT. We respond point-by-point to the major comments, agreeing where the experimental design can be strengthened and outlining specific revisions.
read point-by-point responses
-
Referee: The central viability claim—that gravitational source routing plus token-conditioned target selection plus gated displacement readout functionally substitutes for the dense FFN without extra capacity or changes outside the memory cell—lacks supporting ablations. No experiments disable or randomize the routing/readout components while holding total parameter count fixed at 82.2M (or compare against a generic low-rank memory bank), so it remains possible that any centroid bank would produce similar results and that the graph structure is not load-bearing.
Authors: We agree that component-level ablations would more convincingly demonstrate that the graph-mediated mechanisms are load-bearing rather than incidental. The current manuscript supports viability through stable end-to-end training and by exposing and qualitatively analyzing centroid usage, transition matrices, and source-to-target displacements as direct outputs of the forward pass. To address the concern directly, the revised manuscript will add an ablation subsection that (i) replaces gravitational source routing with uniform selection, (ii) randomizes the 128x128 transition matrix while preserving parameter count, and (iii) compares against a capacity-matched low-rank memory bank without learned routing. These experiments will be reported alongside the existing results. revision: yes
-
Referee: Table or results section reporting validation metrics: the GMT trails the dense baseline by ~0.3 nats / ~10 perplexity points, yet no error bars, multiple random seeds, or capacity-matched dense baseline (e.g., 82M-parameter dense model) are provided. This weakens the ability to attribute the gap specifically to the architectural substitution rather than capacity or optimization differences.
Authors: We acknowledge that single-run results and the absence of a capacity-matched baseline limit attribution of the observed gap. The manuscript already states that results are not intended as a superiority claim and that the 103M dense model serves only as a reference point. In revision we will add an 82M-parameter dense GPT-style baseline trained under identical conditions and report its validation loss/perplexity. We will also state explicitly that all reported numbers are from single training runs (due to compute cost) and will include standard deviations from two additional seeds for the primary GMT and dense models if resources permit; otherwise the limitation will be noted in the text. revision: partial
Circularity Check
No circularity: empirical architecture proposal with measured results
full rationale
The paper is an empirical investigation of an architectural substitution (FFN replaced by graph memory cell with gravitational routing and gated readout). No derivation chain, equations, or first-principles predictions are presented; all reported quantities (validation loss 3.5995, perplexity 36.58, parameter counts) are direct training measurements compared against a baseline. No self-citations, ansatzes, or fitted inputs are invoked as load-bearing for any claimed result. The work is self-contained as an experimental demonstration of viability.
Axiom & Free-Parameter Ledger
free parameters (3)
- number of centroids per block
- transition matrix size
- gravitational source routing parameters
axioms (2)
- domain assumption Causal self-attention remains unchanged and sufficient when paired with the new memory cell.
- domain assumption Standard autoregressive language modeling objective is appropriate for evaluating the replacement.
invented entities (2)
-
centroid bank as memory states
no independent evidence
-
learned directed transition matrix
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Gomez, Lukasz Kaiser, and Illia Polosukhin
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, and Illia Polosukhin. Attention is all you need. InAdvances in Neural Information Processing Systems, volume 30, 2017
2017
-
[2]
Transformer feed-forward layers are key-value memories
Mor Geva, Roei Schuster, Jonathan Berant, and Omer Levy. Transformer feed-forward layers are key-value memories. InProceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pages 5484–5495, 2021
2021
-
[3]
Lo- cating and editing factual associations in GPT
Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Lo- cating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems, volume 35, pages 17359–17372, 2022
2022
-
[4]
Daking Rai, Yilun Zhou, Shi Feng, Abulhair Saparov, and Ziyu Yao. A practical review of mechanistic interpretability for transformer-based language models.arXiv preprint arXiv:2407.02646, 2024
-
[5]
Memory networks
Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. InInternational Conference on Learning Representations, 2015
2015
-
[6]
End-to-end mem- ory networks
Sainbayar Sukhbaatar, Jason Weston, and Rob Fergus. End-to-end mem- ory networks. InAdvances in Neural Information Processing Systems, volume 28, 2015
2015
-
[7]
Large memory layers with product keys
Guillaume Lample, Alexandre Sablayrolles, Marc’Aurelio Ranzato, Lu- dovic Denoyer, and Herv´ e J´ egou. Large memory layers with product keys. InAdvances in Neural Information Processing Systems, volume 32, 2019
2019
-
[8]
Rabe, DeLesley Hutchins, and Christian Szegedy
Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. InInternational Conference on Learning Rep- resentations, 2022
2022
-
[9]
Memory layers at scale
Vincent-Pierre Berges, Barlas Oguz, Daniel Haziza, Wen-Tau Yih, Luke Zettlemoyer, and Gargi Ghosh. Memory layers at scale. InProceedings of the 42nd International Conference on Machine Learning, volume 267 ofProceedings of Machine Learning Research, pages 3831–3842, 2025
2025
-
[10]
Ajay Jaiswal, Lauren Hannah, Han-Byul Kim, Duc Hoang, Arnav Kundu, Mehrdad Farajtabar, and Minsik Cho. MemoryLLM: Plug-n- play interpretable feed-forward memory for transformers.arXiv preprint arXiv:2602.00398, 2026. 58
-
[11]
Zoom in: An introduction to circuits.Distill, 2020
Chris Olah, Nick Cammarata, Ludwig Schubert, Gabriel Goh, Michael Petrov, and Shan Carter. Zoom in: An introduction to circuits.Distill, 2020
2020
-
[12]
Thread: Circuits.Distill, 2020
Nick Cammarata, Shan Carter, Gabriel Goh, Chris Olah, Michael Petrov, Ludwig Schubert, Chelsea Voss, Ben Egan, and Swee Kiat Lim. Thread: Circuits.Distill, 2020
2020
-
[13]
Yordan Yordanov, Matteo Forasassi, Bayar Menzat, Ruizhi Wang, Chang Qi, Markus Kaltenberger, Amine M’Charrak, Tommaso Salvatori, and Thomas Lukasiewicz. Prototype transformer: Towards language model architectures interpretable by design.arXiv preprint arXiv:2602.11852, 2026
-
[14]
A comprehensive study of knowledge editing for large language models,
Ningyu Zhang, Yunzhi Yao, Bozhong Tian, Peng Wang, Shumin Deng, Mengru Wang, Zekun Xi, Shengyu Mao, et al. A comprehensive study of knowledge editing for large language models.arXiv preprint arXiv:2401.01286, 2024
-
[15]
Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines. arXiv preprint arXiv:1410.5401, 2014
work page internal anchor Pith review arXiv 2014
-
[16]
Hybrid computing using a neu- ral network with dynamic external memory.Nature, 538(7626):471–476, 2016
Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Dani- helka, Agnieszka Grabska-Barwi´ nska, Sergio G´ omez Colmenarejo, Ed- ward Grefenstette, Tiago Ramalho, John Agapiou, Adri` a Puigdom` enech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Has...
2016
-
[17]
Hopfield networks is all you need
Hubert Ramsauer, Bernhard Sch¨ afl, Johannes Lehner, Philipp Seidl, Michael Widrich, Lukas Gruber, Markus Holzleitner, Thomas Adler, David Kreil, Michael K Kopp, G¨ unter Klambauer, Johannes Brand- stetter, and Sepp Hochreiter. Hopfield networks is all you need. In International Conference on Learning Representations, 2021
2021
-
[18]
Generalization through memorization: Nearest neighbor lan- guage models
Urvashi Khandelwal, Omer Levy, Dan Jurafsky, Luke Zettlemoyer, and Mike Lewis. Generalization through memorization: Nearest neighbor lan- guage models. InInternational Conference on Learning Representations, 2020. 59
2020
-
[19]
Jacobs, Michael I
Robert A. Jacobs, Michael I. Jordan, Steven J. Nowlan, and Geoffrey E. Hinton. Adaptive mixtures of local experts.Neural Computation, 3(1):79– 87, 1991
1991
-
[20]
Outrageously large neural networks: The sparsely-gated mixture-of-experts layer
Noam Shazeer, Azalia Mirhoseini, Krzysztof Maziarz, Andy Davis, Quoc Le, Geoffrey Hinton, and Jeff Dean. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer. InInternational Conference on Learning Representations, 2017
2017
-
[21]
Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity
William Fedus, Barret Zoph, and Noam Shazeer. Switch transformers: Scaling to trillion parameter models with simple and efficient sparsity. Journal of Machine Learning Research, 23(120):1–39, 2022
2022
-
[22]
Ning Ding, Yehui Tang, Haochen Qin, Zhenli Zhou, Chao Xu, Lin Li, Kai Han, Heng Liao, and Yunhe Wang. MemoryFormer: Minimize transformer computation by removing fully-connected layers.arXiv preprint arXiv:2411.12992, 2024
-
[23]
Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D
Tom B. Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D. Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel M. Ziegler, Jeffrey Wu, Clemens Winter, Christopher Hesse, Mark Chen, Eric Sigler, Mateusz Lit...
2020
-
[24]
Using the output embedding to improve language models
Ofir Press and Lior Wolf. Using the output embedding to improve language models. InProceedings of the 15th Conference of the European Chapter of the Association for Computational Linguistics: Volume 2, Short Papers, pages 157–163, 2017
2017
-
[25]
On layer normalization in the transformer architecture
Ruibin Xiong, Yunchang Yang, Di He, Kai Zheng, Shuxin Zheng, Chen Xing, Huishuai Zhang, Yanyan Lan, Liwei Wang, and Tie-Yan Liu. On layer normalization in the transformer architecture. InProceedings of the 37th International Conference on Machine Learning, volume 119 of Proceedings of Machine Learning Research, 2020
2020
-
[26]
Neural discrete representation learning
Aaron van den Oord, Oriol Vinyals, and Koray Kavukcuoglu. Neural discrete representation learning. InAdvances in Neural Information Processing Systems, volume 30, 2017. 60
2017
-
[27]
OpenWebText Corpus
Aaron Gokaslan and Vanya Cohen. OpenWebText Corpus. https: //skylion007.github.io/OpenWebTextCorpus/, 2019
2019
-
[28]
Language models are unsupervised multitask learners
Alec Radford, Jeffrey Wu, Rewon Child, David Luan, Dario Amodei, and Ilya Sutskever. Language models are unsupervised multitask learners. Technical report, OpenAI, 2019
2019
-
[29]
Think you have Solved Question Answering? Try ARC, the AI2 Reasoning Challenge
Peter Clark, Isaac Cowhey, Oren Etzioni, Tushar Khot, Ashish Sabhar- wal, Carissa Schoenick, and Oyvind Tafjord. Think you have solved question answering? try ARC, the AI2 reasoning challenge.arXiv preprint arXiv:1803.05457, 2018
work page internal anchor Pith review arXiv 2018
-
[30]
The language model evaluation harness, July 2024
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac’h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lin- tang Sutawika, Eric Tang, Anish Thite, Ben Wang, Kevin Wang, and Andy Zou. The langu...
2024
-
[31]
HellaSwag: Can a machine really finish your sentence? InProceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019
Rowan Zellers, Ari Holtzman, Yonatan Bisk, Ali Farhadi, and Yejin Choi. HellaSwag: Can a machine really finish your sentence? InProceed- ings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019
2019
-
[32]
PIQA: Reasoning about physical commonsense in natural lan- guage
Yonatan Bisk, Rowan Zellers, Ronan Le Bras, Jianfeng Gao, and Yejin Choi. PIQA: Reasoning about physical commonsense in natural lan- guage. InProceedings of the AAAI Conference on Artificial Intelligence, volume 34, pages 7432–7439, 2020
2020
-
[33]
WinoGrande: An adversarial winograd schema challenge at scale
Keisuke Sakaguchi, Ronan Le Bras, Chandra Bhagavatula, and Yejin Choi. WinoGrande: An adversarial winograd schema challenge at scale. InProceedings of the AAAI Conference on Artificial Intelligence, vol- ume 34, pages 8732–8740, 2020
2020
-
[34]
Generating diverse high-fidelity images with VQ-V AE-2
Ali Razavi, Aaron van den Oord, and Oriol Vinyals. Generating diverse high-fidelity images with VQ-V AE-2. InAdvances in Neural Information Processing Systems, volume 32, pages 14837–14847, 2019
2019
-
[35]
Taming transformers for high-resolution image synthesis
Patrick Esser, Robin Rombach, and Bjorn Ommer. Taming transformers for high-resolution image synthesis. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), pages 12873–12883, 2021. 61
2021
-
[36]
How contextual are contextualized word represen- tations? comparing the geometry of BERT, ELMo, and GPT-2 embed- dings
Kawin Ethayarajh. How contextual are contextualized word represen- tations? comparing the geometry of BERT, ELMo, and GPT-2 embed- dings. InProceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing, pages 55–65, 2019
2019
-
[37]
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
Kevin Wang, Alexandre Variengien, Arthur Conmy, Buck Shlegeris, and Jacob Steinhardt. Interpretability in the wild: A circuit for indirect object identification in GPT-2 small.arXiv preprint arXiv:2211.00593, 2022
work page internal anchor Pith review arXiv 2022
-
[38]
Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri` a Garriga-Alonso
Arthur Conmy, Augustine N. Mavor-Parker, Aengus Lynch, Stefan Heimersheim, and Adri` a Garriga-Alonso. Towards automated circuit discovery for mechanistic interpretability. InAdvances in Neural Infor- mation Processing Systems, volume 36, 2023
2023
-
[39]
Sparse Autoencoders Find Highly Interpretable Features in Language Models
Hoagy Cunningham, Aidan Ewart, Logan Riggs, Robert Huben, and Lee Sharkey. Sparse autoencoders find highly interpretable features in language models.arXiv preprint arXiv:2309.08600, 2023
work page internal anchor Pith review arXiv 2023
-
[40]
Yudong Wang, Damai Dai, and Zhifang Sui. Exploring activation pat- terns of parameters in language models.arXiv preprint arXiv:2405.17799, 2024
-
[41]
Neuron-guided interpretation of code LLMs: Where, why, and how? In FSE, 2026
Zhe Yin, Xiaodong Gu, and Beijun Shen. Neuron-guided interpretation of code LLMs: Where, why, and how?arXiv preprint arXiv:2512.19980, 2025
-
[42]
Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection
Dong Gong, Lingqiao Liu, Vuong Le, Budhaditya Saha, Moussa Reda Mansour, Svetha Venkatesh, and Anton van den Hengel. Memorizing normality to detect anomaly: Memory-augmented deep autoencoder for unsupervised anomaly detection. InProceedings of the IEEE/CVF International Conference on Computer Vision, pages 1705–1714, 2019
2019
-
[43]
PaDiM: A patch distribution modeling framework for anomaly detection and localization
Thomas Defard, Aleksandr Setkov, Angelique Loesch, and Romaric Au- digier. PaDiM: A patch distribution modeling framework for anomaly detection and localization. InPattern Recognition. ICPR International Workshops and Challenges, pages 475–489, 2021
2021
-
[44]
Towards total recall in industrial anomaly detection
Karsten Roth, Latha Pemula, Joaquin Zepeda, Bernhard Sch¨ olkopf, Thomas Brox, and Peter Gehler. Towards total recall in industrial anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 14318–14328, 2022. 62
2022
-
[45]
MVTec AD: A comprehensive real-world dataset for unsupervised anomaly detection
Paul Bergmann, Michael Fauser, David Sattlegger, and Carsten Ste- ger. MVTec AD: A comprehensive real-world dataset for unsupervised anomaly detection. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 9592–9600, 2019. 63 A Reference V7 Realization Details This appendix records implementation-fidelity details neede...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.