pith. sign in

arxiv: 2604.23862 · v3 · pith:HSPOK5VNnew · submitted 2026-04-26 · 💻 cs.LG · cs.AI· cs.CL

Graph Memory Transformer (GMT)

Pith reviewed 2026-05-08 06:24 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords Graph Memory Transformertransformerfeed-forward networkmemory graphcentroidslanguage modelinginterpretability
0
0 comments X

The pith

A learned graph of memory centroids can replace the feed-forward sublayer in a decoder-only transformer.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper tests whether the feed-forward network inside each transformer block can be swapped for an explicit memory graph without breaking the autoregressive setup. Tokens are moved across a bank of learned centroids linked by a directed transition matrix instead of undergoing a dense transformation. This design keeps the model size smaller and makes the internal routing visible as source selection, target choice, and displacement vectors. A reader might care because it suggests a route to language models whose knowledge updates and internal states are more directly editable and observable. The resulting 82M-parameter model trains without collapse and stays competitive on zero-shot tasks even though its validation perplexity trails the 103M dense baseline.

Core claim

The Graph Memory Transformer (GMT) keeps causal self-attention unchanged but replaces every FFN with a memory cell containing 128 centroids and a 128 by 128 learned transition matrix. Token representations select a source centroid through gravitational routing, choose a target based on the current token, and read out a gated displacement that moves the representation from source toward target. Each of the 16 blocks performs this navigation rather than a standard linear transformation, producing an 82.2 million parameter decoder-only language model whose memory operations remain directly inspectable during the forward pass.

What carries the argument

The memory cell that performs gravitational source routing, token-conditioned target selection, and gated displacement readout to compute movement between centroids instead of a dense feed-forward transformation.

Load-bearing premise

Gravitational source routing combined with token-conditioned target selection and gated displacement readout can match the computational role of a dense feed-forward network using only the memory cell's own parameters.

What would settle it

Training the GMT model on the same data and observing that it diverges or produces incoherent text on basic continuation tasks would show that the memory graph cannot substitute for the FFN.

Figures

Figures reproduced from arXiv: 2604.23862 by Evelina Lamma, Niccol\`o Ferrari, Nicola Zanarini.

Figure 1
Figure 1. Figure 1: Slot-routing flow at Block 00 view at source ↗
Figure 2
Figure 2. Figure 2: Slot-routing flow at Block 06. 35 view at source ↗
Figure 3
Figure 3. Figure 3: Slot-routing flow at Block 11. 36 view at source ↗
Figure 4
Figure 4. Figure 4: Topic-separated Block 11 routing flows for the narrative, political, view at source ↗
Figure 5
Figure 5. Figure 5: Slot-routing flow at Block 15. The sequence from Blocks 00, 06, view at source ↗
Figure 6
Figure 6. Figure 6: Active edge structure at Block 00. Darker cells indicate stronger view at source ↗
Figure 7
Figure 7. Figure 7: Active edge structure at Block 06 view at source ↗
Figure 8
Figure 8. Figure 8: Active edge structure at Block 11. 41 view at source ↗
Figure 9
Figure 9. Figure 9: Active edge structure at Block 15. These are the same representative view at source ↗
Figure 10
Figure 10. Figure 10: Slot-routing flow in block 13 for a political-text probe, illustrating view at source ↗
read the original abstract

We investigate whether the Feed-Forward Network (FFN) sublayer in a decoder-only transformer can be replaced by an explicit learned memory graph while preserving the surrounding autoregressive architecture. The proposed Graph Memory Transformer (GMT) keeps causal self-attention intact, but replaces the usual per-token FFN transformation with a memory cell that routes token representations over a learned bank of centroids connected by a learned directed transition matrix. In the base GMT v7 instantiation studied here, each of 16 transformer blocks contains 128 centroids, a 128 * 128 edge matrix, gravitational source routing, token-conditioned target selection, and a gated displacement readout. The cell therefore returns movement from an estimated source memory state toward a target memory state, rather than a retrieved value. The resulting model is a fully decoder-only language model with 82.2M trainable parameters and no dense FFN sublayers, compared with a 103.0M-parameter dense GPT-style baseline used in the evaluation. The base v7 model trains stably and exposes centroid usage, transition structure, and source-to-target movement as directly inspectable quantities of the forward computation. It remains behind the larger dense baseline in validation loss and perplexity (3.5995/36.58 vs. 3.2903/26.85), while showing close zero-shot benchmark behavior under the evaluated setting. These results are not intended as a state-of-the-art claim; they support the viability and structural interpretability of replacing dense within-token transformation with graph-mediated memory navigation. Broader scaling, optimized kernels, and more extensive benchmark evaluation are left for subsequent work.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes replacing the FFN sublayers in decoder-only transformers with a Graph Memory Transformer (GMT) cell that routes token representations over a learned bank of 128 centroids per block connected by a 128x128 directed transition matrix, using gravitational source routing, token-conditioned target selection, and gated displacement readout to return source-to-target movements rather than dense transformations. The base GMT v7 model (82.2M parameters, 16 blocks) trains stably, exposes centroid usage, transitions, and movements as inspectable forward-pass quantities, achieves validation loss/perplexity of 3.5995/36.58 (vs. 3.2903/26.85 for a 103M dense GPT-style baseline), and shows comparable zero-shot benchmark behavior, supporting the viability of graph-mediated memory navigation as an FFN substitute without claiming SOTA results.

Significance. If the substitution holds under further scrutiny, the approach could improve interpretability by making memory operations explicit and directly analyzable, while using fewer parameters than the dense baseline. Stable training and the inspectable quantities are concrete strengths that enable new analyses of internal dynamics. The performance gap and lack of scaling results limit immediate impact, but the work provides a foundation for memory-graph alternatives to opaque FFNs.

major comments (2)
  1. [Experimental evaluation / Results] The central viability claim—that gravitational source routing plus token-conditioned target selection plus gated displacement readout functionally substitutes for the dense FFN without extra capacity or changes outside the memory cell—lacks supporting ablations. No experiments disable or randomize the routing/readout components while holding total parameter count fixed at 82.2M (or compare against a generic low-rank memory bank), so it remains possible that any centroid bank would produce similar results and that the graph structure is not load-bearing.
  2. [Results] Table or results section reporting validation metrics: the GMT trails the dense baseline by ~0.3 nats / ~10 perplexity points, yet no error bars, multiple random seeds, or capacity-matched dense baseline (e.g., 82M-parameter dense model) are provided. This weakens the ability to attribute the gap specifically to the architectural substitution rather than capacity or optimization differences.
minor comments (1)
  1. [Abstract] The abstract states 'close zero-shot benchmark behavior' without naming the specific benchmarks or reporting exact scores, which would strengthen the claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive note on the interpretability potential of GMT. We respond point-by-point to the major comments, agreeing where the experimental design can be strengthened and outlining specific revisions.

read point-by-point responses
  1. Referee: The central viability claim—that gravitational source routing plus token-conditioned target selection plus gated displacement readout functionally substitutes for the dense FFN without extra capacity or changes outside the memory cell—lacks supporting ablations. No experiments disable or randomize the routing/readout components while holding total parameter count fixed at 82.2M (or compare against a generic low-rank memory bank), so it remains possible that any centroid bank would produce similar results and that the graph structure is not load-bearing.

    Authors: We agree that component-level ablations would more convincingly demonstrate that the graph-mediated mechanisms are load-bearing rather than incidental. The current manuscript supports viability through stable end-to-end training and by exposing and qualitatively analyzing centroid usage, transition matrices, and source-to-target displacements as direct outputs of the forward pass. To address the concern directly, the revised manuscript will add an ablation subsection that (i) replaces gravitational source routing with uniform selection, (ii) randomizes the 128x128 transition matrix while preserving parameter count, and (iii) compares against a capacity-matched low-rank memory bank without learned routing. These experiments will be reported alongside the existing results. revision: yes

  2. Referee: Table or results section reporting validation metrics: the GMT trails the dense baseline by ~0.3 nats / ~10 perplexity points, yet no error bars, multiple random seeds, or capacity-matched dense baseline (e.g., 82M-parameter dense model) are provided. This weakens the ability to attribute the gap specifically to the architectural substitution rather than capacity or optimization differences.

    Authors: We acknowledge that single-run results and the absence of a capacity-matched baseline limit attribution of the observed gap. The manuscript already states that results are not intended as a superiority claim and that the 103M dense model serves only as a reference point. In revision we will add an 82M-parameter dense GPT-style baseline trained under identical conditions and report its validation loss/perplexity. We will also state explicitly that all reported numbers are from single training runs (due to compute cost) and will include standard deviations from two additional seeds for the primary GMT and dense models if resources permit; otherwise the limitation will be noted in the text. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical architecture proposal with measured results

full rationale

The paper is an empirical investigation of an architectural substitution (FFN replaced by graph memory cell with gravitational routing and gated readout). No derivation chain, equations, or first-principles predictions are presented; all reported quantities (validation loss 3.5995, perplexity 36.58, parameter counts) are direct training measurements compared against a baseline. No self-citations, ansatzes, or fitted inputs are invoked as load-bearing for any claimed result. The work is self-contained as an experimental demonstration of viability.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 2 invented entities

The central claim rests on the assumption that a small learned graph can substitute for the FFN without further changes to attention or normalization. The design introduces many learned parameters (centroid embeddings, transition matrix, routing weights) that are fitted during training rather than derived. No new physical or mathematical axioms are invoked beyond standard transformer training assumptions.

free parameters (3)
  • number of centroids per block
    Fixed at 128; chosen by hand to balance capacity and inspectability.
  • transition matrix size
    128x128 learned matrix per block; size is a design choice.
  • gravitational source routing parameters
    Learned weights that map token states to source centroids.
axioms (2)
  • domain assumption Causal self-attention remains unchanged and sufficient when paired with the new memory cell.
    Stated in the abstract as keeping causal self-attention intact.
  • domain assumption Standard autoregressive language modeling objective is appropriate for evaluating the replacement.
    Implicit in the comparison to GPT-style baseline.
invented entities (2)
  • centroid bank as memory states no independent evidence
    purpose: Discrete memory locations that tokens route between instead of dense FFN transformation.
    New postulated memory structure; no independent evidence outside the model itself.
  • learned directed transition matrix no independent evidence
    purpose: Encodes movement rules between memory centroids.
    Invented component of the graph memory cell.

pith-pipeline@v0.9.0 · 5587 in / 1831 out tokens · 43532 ms · 2026-05-08T06:24:14.512096+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.