Graph Memory Transformer (GMT)
Pith reviewed 2026-05-08 06:24 UTC · model grok-4.3
The pith
A learned graph of memory centroids can replace the feed-forward sublayer in a decoder-only transformer.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The Graph Memory Transformer (GMT) keeps causal self-attention unchanged but replaces every FFN with a memory cell containing 128 centroids and a 128 by 128 learned transition matrix. Token representations select a source centroid through gravitational routing, choose a target based on the current token, and read out a gated displacement that moves the representation from source toward target. Each of the 16 blocks performs this navigation rather than a standard linear transformation, producing an 82.2 million parameter decoder-only language model whose memory operations remain directly inspectable during the forward pass.
What carries the argument
The memory cell that performs gravitational source routing, token-conditioned target selection, and gated displacement readout to compute movement between centroids instead of a dense feed-forward transformation.
Load-bearing premise
Gravitational source routing combined with token-conditioned target selection and gated displacement readout can match the computational role of a dense feed-forward network using only the memory cell's own parameters.
What would settle it
Training the GMT model on the same data and observing that it diverges or produces incoherent text on basic continuation tasks would show that the memory graph cannot substitute for the FFN.
Figures
read the original abstract
We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes replacing the FFN sublayers in decoder-only transformers with a Graph Memory Transformer (GMT) cell that routes token representations over a learned bank of 128 centroids per block connected by a 128x128 directed transition matrix, using gravitational source routing, token-conditioned target selection, and gated displacement readout to return source-to-target movements rather than dense transformations. The base GMT v7 model (82.2M parameters, 16 blocks) trains stably, exposes centroid usage, transitions, and movements as inspectable forward-pass quantities, achieves validation loss/perplexity of 3.5995/36.58 (vs. 3.2903/26.85 for a 103M dense GPT-style baseline), and shows comparable zero-shot benchmark behavior, supporting the viability of graph-mediated memory navigation as an FFN substitute without claiming SOTA results.
Significance. If the substitution holds under further scrutiny, the approach could improve interpretability by making memory operations explicit and directly analyzable, while using fewer parameters than the dense baseline. Stable training and the inspectable quantities are concrete strengths that enable new analyses of internal dynamics. The performance gap and lack of scaling results limit immediate impact, but the work provides a foundation for memory-graph alternatives to opaque FFNs.
major comments (2)
- [Experimental evaluation / Results] The central viability claim—that gravitational source routing plus token-conditioned target selection plus gated displacement readout functionally substitutes for the dense FFN without extra capacity or changes outside the memory cell—lacks supporting ablations. No experiments disable or randomize the routing/readout components while holding total parameter count fixed at 82.2M (or compare against a generic low-rank memory bank), so it remains possible that any centroid bank would produce similar results and that the graph structure is not load-bearing.
- [Results] Table or results section reporting validation metrics: the GMT trails the dense baseline by ~0.3 nats / ~10 perplexity points, yet no error bars, multiple random seeds, or capacity-matched dense baseline (e.g., 82M-parameter dense model) are provided. This weakens the ability to attribute the gap specifically to the architectural substitution rather than capacity or optimization differences.
minor comments (1)
- [Abstract] The abstract states 'close zero-shot benchmark behavior' without naming the specific benchmarks or reporting exact scores, which would strengthen the claim.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive note on the interpretability potential of GMT. We respond point-by-point to the major comments, agreeing where the experimental design can be strengthened and outlining specific revisions.
read point-by-point responses
-
Referee: The central viability claim—that gravitational source routing plus token-conditioned target selection plus gated displacement readout functionally substitutes for the dense FFN without extra capacity or changes outside the memory cell—lacks supporting ablations. No experiments disable or randomize the routing/readout components while holding total parameter count fixed at 82.2M (or compare against a generic low-rank memory bank), so it remains possible that any centroid bank would produce similar results and that the graph structure is not load-bearing.
Authors: We agree that component-level ablations would more convincingly demonstrate that the graph-mediated mechanisms are load-bearing rather than incidental. The current manuscript supports viability through stable end-to-end training and by exposing and qualitatively analyzing centroid usage, transition matrices, and source-to-target displacements as direct outputs of the forward pass. To address the concern directly, the revised manuscript will add an ablation subsection that (i) replaces gravitational source routing with uniform selection, (ii) randomizes the 128x128 transition matrix while preserving parameter count, and (iii) compares against a capacity-matched low-rank memory bank without learned routing. These experiments will be reported alongside the existing results. revision: yes
-
Referee: Table or results section reporting validation metrics: the GMT trails the dense baseline by ~0.3 nats / ~10 perplexity points, yet no error bars, multiple random seeds, or capacity-matched dense baseline (e.g., 82M-parameter dense model) are provided. This weakens the ability to attribute the gap specifically to the architectural substitution rather than capacity or optimization differences.
Authors: We acknowledge that single-run results and the absence of a capacity-matched baseline limit attribution of the observed gap. The manuscript already states that results are not intended as a superiority claim and that the 103M dense model serves only as a reference point. In revision we will add an 82M-parameter dense GPT-style baseline trained under identical conditions and report its validation loss/perplexity. We will also state explicitly that all reported numbers are from single training runs (due to compute cost) and will include standard deviations from two additional seeds for the primary GMT and dense models if resources permit; otherwise the limitation will be noted in the text. revision: partial
Circularity Check
No circularity: empirical architecture proposal with measured results
full rationale
The paper is an empirical investigation of an architectural substitution (FFN replaced by graph memory cell with gravitational routing and gated readout). No derivation chain, equations, or first-principles predictions are presented; all reported quantities (validation loss 3.5995, perplexity 36.58, parameter counts) are direct training measurements compared against a baseline. No self-citations, ansatzes, or fitted inputs are invoked as load-bearing for any claimed result. The work is self-contained as an experimental demonstration of viability.
Axiom & Free-Parameter Ledger
free parameters (3)
- number of centroids per block
- transition matrix size
- gravitational source routing parameters
axioms (2)
- domain assumption Causal self-attention remains unchanged and sufficient when paired with the new memory cell.
- domain assumption Standard autoregressive language modeling objective is appropriate for evaluating the replacement.
invented entities (2)
-
centroid bank as memory states
no independent evidence
-
learned directed transition matrix
no independent evidence
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.