arxiv: 2605.10643 · v1 · submitted 2026-05-11 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

A Single-Layer Model Can Do Language Modeling

Zanmin Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:36 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords single-layer language modelsrecurrent networksstate vectorperplexitygrounded prediction networksfineweb-edutransformer alternativesmemory geometry

0 comments

The pith

A single recurrent layer with one shared state vector can approach the language modeling performance of much deeper stacked models.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper challenges the standard practice of building language models by stacking many layers, each maintaining its own independent state such as a KV cache or matrix. It proposes instead a Grounded Prediction Network that reuses one state vector at every step through a single recurrent block consisting of one feed-forward network and one shared matrix memory. At 130 million parameters this one-layer version reaches a perplexity of 18.06 on FineWeb-Edu, which is within 13 percent of a twelve-layer transformer baseline and within 18 percent of a ten-layer gated delta network. A two-layer version narrows the remaining gap to roughly 6 and 11 percent respectively. Because only one state exists, the geometry of that vector can be inspected directly, revealing a default-token direction, a content horizon of tens of tokens, and spontaneous fast and slow memory pools.

Core claim

Grounded Prediction Networks achieve competitive language-modeling perplexity by processing language through a single recurrent state vector that is updated at each step by one shared feed-forward network and one matrix memory, without requiring the per-layer independent states used in transformers, Mamba, or similar deep architectures.

What carries the argument

Grounded Prediction Networks (GPN): a single recurrent state vector revisited through one block containing one feed-forward network and one shared matrix memory.

If this is right

Model depth can be reduced from ten or twelve layers to one or two while retaining most perplexity performance.
Only a single state vector needs to be maintained and inspected during generation or analysis.
Memory behavior inside the state spontaneously partitions into fast and slow retention components.
Direct geometric measurements of the working context become possible because there is only one vector rather than many.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Inference speed and memory use could improve because only one state vector is cached instead of one per layer.
The single-state design may make it easier to diagnose failure modes such as context loss on specific dependency lengths.
The observed spontaneous split into fast and slow memory pools suggests a route for adding explicit multi-timescale control without extra layers.
If the approach scales, it could reduce the hardware requirements for training large language models while preserving most capability.

Load-bearing premise

A single recurrent state vector can capture and retain enough long-range context and dependencies for competitive language modeling without independent states per layer.

What would settle it

Measure whether the one-layer GPN's perplexity rises sharply above the deeper baselines when evaluated on sequences that require reliable retention of dependencies longer than a few dozen tokens.

Figures

Figures reproduced from arXiv: 2605.10643 by Zanmin Wang.

**Figure 2.** Figure 2: One step of GPN (no memory), horizontal. Blocks show the operation formulas directly; [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: One step of GPN+M, horizontal. Each state bundle carries both the working context [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: State geometry (1L GPN+M, 130M). (a) Sorted per-dim std: heavy tail, no dead dimensions (min 1.05, max 11.3). (b) Off-diagonal correlations (300-dim sample): |mean| = 0.03. (c) Cumulative variance: 1131 PCs for 90%, 2099 for 99%; stable rank ≈ 40. High-rank, near-decorrelated, no dead dimensions — with no explicit regularizer. A dominant mean direction. The state’s mean s¯ = Et[s p t ] has magnitude ∥s¯∥ … view at source ↗

**Figure 5.** Figure 5: Temporal horizons. (a) State cosine trajectory Et[cos(s p t , s p t+k )] vs. lag k (log-x). The raw state (blue) plateaus near 0.65 up to k = 1500 because the shared mean direction dominates both vectors; after subtracting each sequence’s mean (red), the content-bearing component drops from 0.31 at k=1 to ≈ 0 by k ≈ 256. (b) Per-head retention Q τ≤k α (h) τ of the matrix memory M, one curve per head; dots … view at source ↗

read the original abstract

Modern language models scale depth by stacking layers, each holding its own state - a per-layer KV cache in transformers, a per-layer matrix in Mamba, Gated DeltaNet (GDN), RWKV, and xLSTM. Biological systems lean heavily on recurrence rather than on stacking. We ask how far that shape can go on language modeling. We propose Grounded Prediction Networks (GPN): one state vector revisited at every step through a single recurrent block - one FFN, one shared matrix memory. At 130M parameters, a 1-layer GPN+M reaches FineWeb-Edu perplexity 18.06, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34); a 2-layer variant closes the gap to 6%/11%. We do not match the deep baselines. Because the working context is a single vector, we can directly inspect its geometry: a persistent default-token direction, a content-bearing horizon of tens of tokens, and memory heads that split spontaneously into fast and slow retention pools.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Grounded Prediction Networks (GPN), a single-layer recurrent model consisting of one state vector updated through a single recurrent block (one FFN and one shared matrix memory). It reports that a 1-layer GPN+M at 130M parameters achieves 18.06 perplexity on FineWeb-Edu, within 13% of a 12-layer Transformer++ (16.05) and 18% of a 10-layer GDN (15.34); a 2-layer variant narrows the gap to 6%/11%. The work emphasizes direct geometric inspection of the single state vector, identifying a persistent default-token direction, a content-bearing horizon of tens of tokens, and spontaneous fast/slow retention pools.

Significance. If the performance claims are reproducible, the result would be significant for showing that a single recurrent state vector can approach the perplexity of much deeper stacked architectures on language modeling, challenging the default assumption that depth is required for competitive performance. The direct inspectability of the state geometry is a clear strength, as is the biological motivation for recurrence over stacking. The work does not claim to surpass the deep baselines.

major comments (2)

[Abstract] Abstract: the central empirical claim (18.06 perplexity for 1-layer GPN+M) is presented without any training details, hyperparameter settings, baseline implementation notes, data preprocessing, or statistical significance tests, making it impossible to verify the reported gaps to Transformer++ and GDN.
[Abstract] Abstract: the stated 'content-bearing horizon of tens of tokens' for the single state vector conflicts with the long-range dependencies (hundreds to thousands of tokens) typical in FineWeb-Edu; without context-length ablations, retention curves, or quantitative memory analysis, the sufficiency of the single-vector design for the reported perplexity remains untested.

minor comments (2)

The notation 'GPN+M' is used in the abstract without definition or expansion in the provided text.
No variance, multiple runs, or error bars are reported for any perplexity numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and describe the revisions we will make to improve clarity and verifiability while preserving the manuscript's core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the central empirical claim (18.06 perplexity for 1-layer GPN+M) is presented without any training details, hyperparameter settings, baseline implementation notes, data preprocessing, or statistical significance tests, making it impossible to verify the reported gaps to Transformer++ and GDN.

Authors: We agree that the abstract's brevity limits immediate verifiability. The full manuscript details the training setup, hyperparameters, data preprocessing on FineWeb-Edu, baseline implementations, and evaluation protocol in the Methods and Experiments sections. We will revise the abstract to include a concise statement of key training details (dataset, optimizer, and parameter scale) and explicitly reference those sections for full reproducibility. We will also add multiple-run statistics or error estimates to the results tables in the revised version. revision: partial
Referee: [Abstract] Abstract: the stated 'content-bearing horizon of tens of tokens' for the single state vector conflicts with the long-range dependencies (hundreds to thousands of tokens) typical in FineWeb-Edu; without context-length ablations, retention curves, or quantitative memory analysis, the sufficiency of the single-vector design for the reported perplexity remains untested.

Authors: The phrase 'content-bearing horizon of tens of tokens' specifically describes the directly observable geometric content of the single state vector at any timestep, as analyzed via the shared memory matrix in our geometric inspection. Longer-range dependencies are handled through the recurrent dynamics that update this vector over many steps, allowing information to propagate via the persistent state. The reported perplexity on FineWeb-Edu already provides empirical evidence of sufficiency. We acknowledge that additional quantitative support would strengthen the presentation and will add context-length ablations, retention curves, and expanded memory analysis to the revised manuscript. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical architecture comparison with no derivations or self-referential fitting

full rationale

The manuscript introduces GPN as a single-state recurrent architecture and reports direct training results on FineWeb-Edu perplexity against deeper baselines. No equations, parameter-fitting procedures, uniqueness theorems, or ansatzes are present that could reduce a claimed prediction to its own inputs by construction. The central performance numbers (e.g., 18.06 PPL) are obtained from standard training and evaluation runs rather than any self-definitional or fitted-input mechanism. Self-citations are absent from the provided text, and the geometry inspection of the state vector is a post-hoc observation, not a load-bearing premise. The derivation chain is therefore empty and the result is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The claim rests on the domain assumption that one recurrent state vector suffices for language modeling context; no free parameters or invented entities beyond the new GPN name are stated in the abstract.

axioms (1)

domain assumption A single recurrent state vector can capture necessary contextual information for language modeling
This premise is required for the claim that one layer can approach deep-model performance.

invented entities (1)

Grounded Prediction Networks (GPN) no independent evidence
purpose: Single-layer recurrent architecture with one shared state vector
New model family introduced to test the single-state hypothesis.

pith-pipeline@v0.9.0 · 5485 in / 1416 out tokens · 55505 ms · 2026-05-12T04:36:52.527567+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat recovery / embed_strictMono unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

one state vector revisited at every step through a single recurrent block — one FFN, one shared matrix memory
IndisputableMonolith/Foundation/ArrowOfTime.lean z_monotone_absolute / arrow_from_z echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

memory heads that split spontaneously into fast and slow retention pools

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

12 extracted references · 12 canonical work pages · 7 internal anchors

[1]

xlstm: Ex- tended long short-term memory

Maximilian Beck, Korbinian Pöppel, Markus Spanring, Andreas Auer, Oleksandra Prudnikova, Michael Kopp, Günter Klambauer, Johannes Brandstetter, and Sepp Hochreiter. xLSTM: Extended long short-term memory.arXiv preprint arXiv:2405.04517,

work page arXiv
[2]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663, 2025a. Ali Behrouz et al. Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025b. Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured...

work page internal anchor Pith review arXiv
[3]

Griffin: Mixing Gated Linear Recurrences with Local Attention for Efficient Language Models

Soham De, Samuel L. Smith, Anushan Fernando, Aleksandar Botev, George Cristian-Muraru, Albert Gu, Ruba Haroun, Leonard Berrada, Yutian Chen, Srivatsan Srinivasan, et al. Griffin: Mixing gated linear recurrences with local attention for efficient language models.arXiv preprint arXiv:2402.19427,

work page internal anchor Pith review arXiv
[4]

Hymba: A Hybrid-head Architecture for Small Language Models

Xin Dong, Yonggan Fu, Shizhe Diao, Wonmin Byeon, Zijia Chen, Ameya S. Mahabaleshwarkar, Shih- Yang Liu, Matthijs Van Keirsbilck, Min-Hung Chen, Yoshi Suhara, et al. Hymba: A hybrid-head architecture for small language models.arXiv preprint arXiv:2411.13676,

work page arXiv
[5]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces.arXiv preprint arXiv:2312.00752,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Efficiently Modeling Long Sequences with Structured State Spaces

Albert Gu, Karan Goel, and Christopher Ré. Efficiently modeling long sequences with structured state spaces.arXiv preprint arXiv:2111.00396,

work page internal anchor Pith review Pith/arXiv arXiv
[7]

Leave no context behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143, 101, 2024

Tsendsuren Munkhdalai, Manaal Faruqui, and Siddharth Gopal. Leave no context behind: Efficient infinite context transformers with infini-attention.arXiv preprint arXiv:2404.07143,

work page arXiv
[8]

RWKV-7 “Goose” with expressive dynamic state evolution, 2025

Bo Peng et al. RWKV-7 "goose" with expressive dynamic state evolution.arXiv preprint arXiv:2503.14456,

work page arXiv
[9]

Linear transformers are secretly fast weight programmers, 2021

Imanol Schlag, Kazuki Irie, and Jürgen Schmidhuber. Linear transformers are secretly fast weight programmers.arXiv preprint arXiv:2102.11174,

work page arXiv
[10]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, Tatsunori Hashimoto, and Carlos Guestrin. Learning to (learn at test time): RNNs with expressive hidden states.arXiv preprint arXiv:2407.04620,

work page internal anchor Pith review arXiv
[11]

Gated Linear Attention Transformers with Hardware-Efficient Training

Songlin Yang, Bailin Wang, Yikang Shen, Rameswar Panda, and Yoon Kim. Gated linear attention transformers with hardware-efficient training.arXiv preprint arXiv:2312.06635,

work page internal anchor Pith review arXiv
[12]

Gated Delta Networks: Improving Mamba2 with Delta Rule

8 Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464, 2024a. Songlin Yang, Bailin Wang, Yu Zhang, Yikang Shen, and Yoon Kim. Parallelizing linear transformers with the delta rule over sequence length.arXiv preprint arXiv:2406.06484, 2024b. 9

work page internal anchor Pith review arXiv