Repeated Shared Access Enables Grokking, but Edit Propagation Depends on an Addressable Memory

Yanan Niu

arxiv: 2606.20737 · v2 · pith:CX4BHLPBnew · submitted 2026-06-17 · 💻 cs.AI · cs.LG

Repeated Shared Access Enables Grokking, but Edit Propagation Depends on an Addressable Memory

Yanan Niu This is my paper

Pith reviewed 2026-06-26 20:41 UTC · model grok-4.3

classification 💻 cs.AI cs.LG

keywords grokkingfactual editingedit propagationknowledge graph QAtransformer memorylooped transformersshared memory access

0 comments

The pith

Addressable memory enables factual edit propagation while repeated access enables grokking regardless of mechanism.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper examines whether the same architectural features support both out-of-distribution generalization and factual edit propagation in transformers. It runs a 2x2 experiment on a synthetic knowledge-graph question-answering task that crosses loop recurrence with the presence of shared memory. Both looping and memory access allow models to cross the grokking threshold on unseen cases, unlike a plain dense transformer. Edit propagation, however, occurs reliably only when an addressable memory site is present; models without it show almost no propagation even when they loop. The results separate the substrate needed for learning from the substrate needed for editing.

Core claim

Repeated shared access, achieved either by looped recomputation or by repeated memory rereading, suffices to cross the out-of-distribution grokking barrier, whereas edit propagation after a single localized factual edit requires an addressable memory site that the forward pass can write to and later reread; every memory-bearing model exceeds every memory-free model, and a non-recurrent dense model with memory still propagates edits at high rates.

What carries the argument

The 2x2 grid of architectures (dense, looped, dense-plus-memory, looped-plus-memory) that isolates repeated shared access from the presence of an explicit addressable memory store.

Load-bearing premise

The synthetic knowledge-graph QA task and the 2x2 controls isolate memory access from recurrence without confounding effects from scale or task structure.

What would settle it

Finding strong edit propagation after a single factual edit in a memory-free looped model on the same task or a closely matched one would falsify the necessity of addressable memory.

Figures

Figures reproduced from arXiv: 2606.20737 by Yanan Niu.

**Figure 1.** Figure 1: Loop-Memory Coupling architecture. A thin shared backbone (L=3 layers) is iterated R=4 times with tied weights fθ, giving 12 effective layers. At each non-final iteration the latent state queries the same shared memory Memϕ and adds the retrieval by a residual connection; the final iteration is a puretransformer finish before the LM head. Removing the memory recovers the pure looped backbone (Loop); repla… view at source ↗

**Figure 2.** Figure 2: OOD grokking as a binary onset event. Following the binary view of grokking in Wang et al. [1], we plot only whether each configuration crosses the conservative above-chance threshold and, when it does, the first step at which held-out two-hop OOD accuracy reaches 0.1. Bars show five-seed mean onset steps; dots show individual seeds; error bars show seed standard deviations. The Dense 12L anchor never cros… view at source ↗

**Figure 3.** Figure 3: Lesions separate factual recall from compositional use. A. [PITH_FULL_IMAGE:figures/full_fig_p012_3.png] view at source ↗

**Figure 4.** Figure 4: Edit-propagation ladder by substrate. Bars show five-seed means of the strong-propagation metric; dots show individual seeds. Colors follow the Study 1 substrate palette, and hatching marks memorybearing cells. The grey band is empty: every memory-bearing seed is at least 0.713, while every memory-free seed is at most 0.297. Thus the main split is along the memory axis, not the recurrence axis; Loop is a … view at source ↗

read the original abstract

We study factual edit propagation in a controlled synthetic knowledge-graph QA setting using a 2x2 grid that crosses loop recurrence with shared-memory access: a dense transformer (Dense), a looped transformer (Loop), a dense backbone with shared memory (Dense+Mem), and a looped backbone with shared memory (loop-memory coupling, LMC). The two factors dissociate. For learning, both routes to repeated shared access -- looped recomputation and repeated memory rereading -- cross the out-of-distribution (OOD) grokking barrier that Dense fails, so repeated shared access is the behavioral regularity, not a specific architecture. For editing, the substrates split along a different axis: applying a single localized factual edit (conditioned on direct success) and measuring 2-hop propagation on a shared pre-edit-correct set, the edit propagates strongly in both memory-bearing cells (LMC 0.78-0.92, Dense+Mem 0.71-0.96) and only weakly in the memory-free ones (Loop 0.04-0.30, Dense 0.00-0.03). The split is along the memory axis, not the loop axis: every memory-bearing seed exceeds every memory-free seed, with no detectable difference between the two memory cells. Crucially Dense+Mem has no recurrence, so the propagating ingredient is an addressable site that an edit can write to and later computation rereads, not loop recomputation; Loop is at best a partial intermediate. The affordance survives coarsening the store (N=128 to N=13): propagation attenuates but the memory/no-memory split persists, so fine granularity buys precision rather than the affordance itself. These results dissociate learning competence from editing affordance -- repeated shared access suffices to grok, but edit propagation depends on whether the substrate exposes an addressable memory that the forward computation can write to and later reread, an affordance that loop recurrence provides only partially.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The 2x2 cleanly shows repeated shared access drives grokking while addressable memory drives edit propagation, with the split holding across seeds.

read the letter

The paper's main finding is that grokking on this synthetic KG-QA task occurs whenever the model gets repeated access to the same information, whether that comes from a loop or from a shared memory store. Edit propagation, however, requires an addressable memory site that can be written once and reread later; recurrence alone does not suffice. Dense+Mem succeeds at propagation without any looping, while Loop stays close to plain Dense.

The 2x2 grid is the useful part. It crosses the two factors directly and reports that every memory-bearing seed beats every memory-free seed on the 2-hop propagation metric, with LMC and Dense+Mem both in the 0.7-0.96 range and the others at 0-0.3. The coarsening control (N=128 to N=13) keeps the split while lowering absolute numbers, which suggests the affordance is real rather than an artifact of fine-grained storage.

The synthetic task and the conditioning on direct edit success are reasonable controls for isolating the architectural variable. The results line up with the stated claim that the propagating ingredient is the write/read site, not recomputation.

A soft spot is that the abstract gives ranges rather than per-seed values or variance, so the "every exceeds every" statement would be easier to assess with the raw distributions. That said, the stress-test indicates the pattern is consistent and no obvious confound appears in the design.

This is worth a serious referee. The dissociation is concrete and directly relevant to mechanistic work on editing. Readers working on model editing or interpretability would get value from the controlled comparison.

Referee Report

0 major / 3 minor

Summary. The manuscript claims that repeated shared access enables grokking on a synthetic knowledge-graph QA task regardless of whether it arises from loop recurrence or memory rereading, but factual edit propagation requires an addressable memory site that edits can write to and later computation can reread. This is shown via a 2x2 grid (Dense, Loop, Dense+Mem, LMC) where grokking occurs in both repeated-access routes while edit propagation splits cleanly along the memory axis (memory cells 0.71-0.96 vs memory-free 0.00-0.30), with every memory-bearing seed exceeding every memory-free seed and the split persisting after coarsening the memory store from N=128 to N=13.

Significance. If the experimental controls isolate the factors as described, the dissociation between grokking competence and edit-propagation affordance is a useful empirical result for understanding transformer editing. The design credits the clean separation, the observation that Dense+Mem succeeds without recurrence, and the persistence of the memory/no-memory split under coarsening; these elements make the central claim falsifiable and directly testable.

minor comments (3)

[Abstract] Abstract and results: propagation rates are reported as ranges without accompanying error bars, seed counts, or statistical tests; adding these would allow readers to assess the reliability of the claim that every memory-bearing seed exceeds every memory-free seed.
The description of the synthetic KG-QA task and the precise definition of 2-hop propagation (including how the shared pre-edit-correct set is constructed) should be expanded in the methods to support replication and to confirm absence of confounds from task structure.
Figure or table presenting the 2x2 grid outcomes would benefit from explicit labeling of the four cells and the coarsening control to make the memory-axis split visually immediate.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation, accurate summary of the 2x2 ablation results, and recommendation for minor revision. The design's ability to dissociate repeated shared access (for grokking) from addressable memory (for edit propagation) was correctly identified as the central contribution.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is strictly empirical, reporting measured performance differences across a 2x2 experimental grid (Dense, Loop, Dense+Mem, LMC) on a synthetic KG-QA task. No equations, derivations, fitted parameters, or self-citations appear in the provided text; claims rest on direct comparisons (e.g., memory-bearing cells 0.71-0.96 vs memory-free 0.00-0.30 for propagation) without any reduction of outputs to inputs by construction. The design isolates factors via controlled variants, making the central dissociation self-contained against external benchmarks rather than internally referential.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The paper is purely empirical; it introduces no free parameters, axioms, or invented entities beyond standard transformer components and the synthetic task definition.

pith-pipeline@v0.9.1-grok · 5889 in / 1222 out tokens · 12877 ms · 2026-06-26T20:41:57.034881+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 11 linked inside Pith

[1]

Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization.Advances in Neural Information Processing Systems (NeurIPS), 2024

Boshi Wang, Xiang Yue, Yu Su, and Huan Sun. Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization.Advances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2405.15071

arXiv 2024
[2]

Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers.arXiv preprint arXiv:2604.07822, 2026

Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers.arXiv preprint arXiv:2604.07822, 2026. The Ohio State University; concurrent work, same KG-reasoning task as Wang 2024

Pith/arXiv arXiv 2026
[3]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2202.05262 (ROME)

Pith/arXiv arXiv 2022
[4]

Sim-cot: Supervised implicit chain-of-thought.arXiv preprint arXiv:2509.20317, 2025

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. Sim-cot: Supervised implicit chain-of-thought.arXiv preprint arXiv:2509.20317, 2025

arXiv 2025
[5]

Uni- versal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers. InInternational Conference on Learning Representations (ICLR), 2019

2019
[6]

Bar- toldson, BhavyaKailkhura, AbhinavBhatele, andTomGoldstein

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bar- toldson, BhavyaKailkhura, AbhinavBhatele, andTomGoldstein. Scalinguptest-timecompute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171, 2025. Huginn-3.5B

Pith/arXiv arXiv 2025
[7]

Scaling latent reasoning via looped language models.arXiv preprint arXiv:2510.25741, 2025

Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang,...

Pith/arXiv arXiv 2025
[8]

Looped transformers for length generalization

Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2409.15647

arXiv 2025
[9]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reason- ing with latent thoughts: On the power of looped transformers. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2502.17416

arXiv 2025
[10]

ALBERT: A lite BERT for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations (ICLR), 2020. arXiv:1909.11942 (cross- layer parameter sharing)

Pith/arXiv arXiv 2020
[11]

Lessons on parameter sharing across layers in transformers

Sho Takase and Shun Kiyono. Lessons on parameter sharing across layers in transformers. arXiv preprint arXiv:2104.06022, 2021

arXiv 2021
[12]

MoEUT: Mixture-of-experts universal transformers

RóbertCsordás, KazukiIrie, JürgenSchmidhuber, ChristopherPotts, andChristopherD.Man- ning. MoEUT: Mixture-of-experts universal transformers. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2405.16039. 33

arXiv 2024
[13]

CoTFormer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference

Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. CoTFormer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference. InInterna- tional Conference on Learning Representations (ICLR), 2025. arXiv:2310.10845

arXiv 2025
[14]

Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

Pith/arXiv arXiv 2022
[15]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations (ICLR), 2023

2023
[16]

Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390, 2023

Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390, 2023

arXiv 2023
[17]

Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

Pith/arXiv arXiv 2014
[18]

Hybrid computing using a neural network with dynamic external memory.Nature, 538(7626): 471–476, 2016

Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska- Barwińska, Sergio Gomez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adria Puigdomenech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybr...

2016
[19]

Memory networks

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. InInternational Con- ference on Learning Representations (ICLR), 2015. arXiv:1410.3916

Pith/arXiv arXiv 2015
[20]

End-to-end memory networks

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

2015
[21]

Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. Recurrent memory transformer. In Advances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2207.06881

arXiv 2022
[22]

Rabe, DeLesley Hutchins, and Christian Szegedy

Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memoriz- ing transformers. InInternational Conference on Learning Representations (ICLR), 2022. arXiv:2203.08913

arXiv 2022
[23]

Memory layers at scale.arXiv preprint arXiv:2412.09764, 2024

Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, Wen-tau Yih, Luke Zettlemoyer, and Gargi Ghosh. Memory layers at scale.arXiv preprint arXiv:2412.09764, 2024

arXiv 2024
[24]

MLP memory: A retriever-pretrained memory for large language models.arXiv preprint arXiv:2508.01832, 2025

Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, and Zhouhan Lin. MLP memory: A retriever-pretrained memory for large language models.arXiv preprint arXiv:2508.01832, 2025

arXiv 2025
[25]

Titans: Learning to memorize at test time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2025

Pith/arXiv arXiv 2025
[26]

Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, and Andy Toulis

Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, and Andy Toulis. LM2: Large memory models.arXiv preprint arXiv:2502.06049, 2025. 34

arXiv 2025
[27]

Shangyi Geng, Wenting Zhao, and Alexander M. Rush. Great memory, shallow reasoning: Limits of kNN-LMs. InProceedings of NAACL (Short Papers), 2025. arXiv:2408.11815; first posted 2024

arXiv 2025
[28]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Fast model editing at scale. InInternational Conference on Learning Representations (ICLR), 2022. arXiv:2110.11309 (MEND)

arXiv 2022
[29]

Mass- editing memory in a transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2210.07229 (MEMIT)

Pith/arXiv arXiv 2023
[30]

Manning, Christopher Potts, and Danqi Chen

Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning, Christopher Potts, and Danqi Chen. MQuAKE: Assessing knowledge editing in language models via multi-hop questions.arXiv preprint arXiv:2305.14795, 2023

arXiv 2023
[31]

Adaptive loops and memory in transformers: Think harder or know more?arXiv preprint arXiv:2603.08391, 2026

Markus Frey, Behzad Shomali, Ali Hamza Bashir, David Berghaus, Joachim Koehler, and Mehdi Ali. Adaptive loops and memory in transformers: Think harder or know more?arXiv preprint arXiv:2603.08391, 2026. Latent & Implicit Thinking Workshop @ ICLR 2026

arXiv 2026
[32]

Universal transformers need memory: Depth-state trade-offs in adaptive recursive reasoning.arXiv preprint arXiv:2604.21999, 2026

Grigory Sapunov. Universal transformers need memory: Depth-state trade-offs in adaptive recursive reasoning.arXiv preprint arXiv:2604.21999, 2026. 35

Pith/arXiv arXiv 2026

[1] [1]

Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization.Advances in Neural Information Processing Systems (NeurIPS), 2024

Boshi Wang, Xiang Yue, Yu Su, and Huan Sun. Grokked transformers are implicit reasoners: A mechanistic journey to the edge of generalization.Advances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2405.15071

arXiv 2024

[2] [2]

Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers.arXiv preprint arXiv:2604.07822, 2026

Harsh Kohli, Srinivasan Parthasarathy, Huan Sun, and Yuekun Yao. Loop, think, & generalize: Implicit reasoning in recurrent-depth transformers.arXiv preprint arXiv:2604.07822, 2026. The Ohio State University; concurrent work, same KG-reasoning task as Wang 2024

Pith/arXiv arXiv 2026

[3] [3]

Locating and editing factual associations in GPT

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. InAdvances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2202.05262 (ROME)

Pith/arXiv arXiv 2022

[4] [4]

Sim-cot: Supervised implicit chain-of-thought.arXiv preprint arXiv:2509.20317, 2025

Xilin Wei, Xiaoran Liu, Yuhang Zang, Xiaoyi Dong, Yuhang Cao, Jiaqi Wang, Xipeng Qiu, and Dahua Lin. Sim-cot: Supervised implicit chain-of-thought.arXiv preprint arXiv:2509.20317, 2025

arXiv 2025

[5] [5]

Uni- versal transformers

Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Łukasz Kaiser. Uni- versal transformers. InInternational Conference on Learning Representations (ICLR), 2019

2019

[6] [6]

Bar- toldson, BhavyaKailkhura, AbhinavBhatele, andTomGoldstein

Jonas Geiping, Sean McLeish, Neel Jain, John Kirchenbauer, Siddharth Singh, Brian R. Bar- toldson, BhavyaKailkhura, AbhinavBhatele, andTomGoldstein. Scalinguptest-timecompute with latent reasoning: A recurrent depth approach.arXiv preprint arXiv:2502.05171, 2025. Huginn-3.5B

Pith/arXiv arXiv 2025

[7] [7]

Scaling latent reasoning via looped language models.arXiv preprint arXiv:2510.25741, 2025

Rui-Jie Zhu, Zixuan Wang, Kai Hua, Tianyu Zhang, Ziniu Li, Haoran Que, Boyi Wei, Zixin Wen, Fan Yin, He Xing, Lu Li, Jiajun Shi, Kaijing Ma, Shanda Li, Taylor Kergan, Andrew Smith, Xingwei Qu, Mude Hui, Bohong Wu, Qiyang Min, Hongzhi Huang, Xun Zhou, Wei Ye, Jiaheng Liu, Jian Yang, Yunfeng Shi, Chenghua Lin, Enduo Zhao, Tianle Cai, Ge Zhang, Wenhao Huang,...

Pith/arXiv arXiv 2025

[8] [8]

Looped transformers for length generalization

Ying Fan, Yilun Du, Kannan Ramchandran, and Kangwook Lee. Looped transformers for length generalization. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2409.15647

arXiv 2025

[9] [9]

Nikunj Saunshi, Nishanth Dikkala, Zhiyuan Li, Sanjiv Kumar, and Sashank J. Reddi. Reason- ing with latent thoughts: On the power of looped transformers. InInternational Conference on Learning Representations (ICLR), 2025. arXiv:2502.17416

arXiv 2025

[10] [10]

ALBERT: A lite BERT for self-supervised learning of language representations

Zhenzhong Lan, Mingda Chen, Sebastian Goodman, Kevin Gimpel, Piyush Sharma, and Radu Soricut. ALBERT: A lite BERT for self-supervised learning of language representations. In International Conference on Learning Representations (ICLR), 2020. arXiv:1909.11942 (cross- layer parameter sharing)

Pith/arXiv arXiv 2020

[11] [11]

Lessons on parameter sharing across layers in transformers

Sho Takase and Shun Kiyono. Lessons on parameter sharing across layers in transformers. arXiv preprint arXiv:2104.06022, 2021

arXiv 2021

[12] [12]

MoEUT: Mixture-of-experts universal transformers

RóbertCsordás, KazukiIrie, JürgenSchmidhuber, ChristopherPotts, andChristopherD.Man- ning. MoEUT: Mixture-of-experts universal transformers. InAdvances in Neural Information Processing Systems (NeurIPS), 2024. arXiv:2405.16039. 33

arXiv 2024

[13] [13]

CoTFormer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference

Amirkeivan Mohtashami, Matteo Pagliardini, and Martin Jaggi. CoTFormer: A chain-of- thought driven architecture with budget-adaptive computation cost at inference. InInterna- tional Conference on Learning Representations (ICLR), 2025. arXiv:2310.10845

arXiv 2025

[14] [14]

Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

Alethea Power, Yuri Burda, Harri Edwards, Igor Babuschkin, and Vedant Misra. Grokking: Generalization beyond overfitting on small algorithmic datasets.arXiv preprint arXiv:2201.02177, 2022

Pith/arXiv arXiv 2022

[15] [15]

Progress measures for grokking via mechanistic interpretability

Neel Nanda, Lawrence Chan, Tom Lieberum, Jess Smith, and Jacob Steinhardt. Progress measures for grokking via mechanistic interpretability. InInternational Conference on Learning Representations (ICLR), 2023

2023

[16] [16]

Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390, 2023

Vikrant Varma, Rohin Shah, Zachary Kenton, János Kramár, and Ramana Kumar. Explaining grokking through circuit efficiency.arXiv preprint arXiv:2309.02390, 2023

arXiv 2023

[17] [17]

Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

Alex Graves, Greg Wayne, and Ivo Danihelka. Neural turing machines.arXiv preprint arXiv:1410.5401, 2014

Pith/arXiv arXiv 2014

[18] [18]

Hybrid computing using a neural network with dynamic external memory.Nature, 538(7626): 471–476, 2016

Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska- Barwińska, Sergio Gomez Colmenarejo, Edward Grefenstette, Tiago Ramalho, John Agapiou, Adria Puigdomenech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. Hybr...

2016

[19] [19]

Memory networks

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. InInternational Con- ference on Learning Representations (ICLR), 2015. arXiv:1410.3916

Pith/arXiv arXiv 2015

[20] [20]

End-to-end memory networks

Sainbayar Sukhbaatar, Arthur Szlam, Jason Weston, and Rob Fergus. End-to-end memory networks. InAdvances in Neural Information Processing Systems (NeurIPS), 2015

2015

[21] [21]

Aydar Bulatov, Yuri Kuratov, and Mikhail S. Burtsev. Recurrent memory transformer. In Advances in Neural Information Processing Systems (NeurIPS), 2022. arXiv:2207.06881

arXiv 2022

[22] [22]

Rabe, DeLesley Hutchins, and Christian Szegedy

Yuhuai Wu, Markus N. Rabe, DeLesley Hutchins, and Christian Szegedy. Memoriz- ing transformers. InInternational Conference on Learning Representations (ICLR), 2022. arXiv:2203.08913

arXiv 2022

[23] [23]

Memory layers at scale.arXiv preprint arXiv:2412.09764, 2024

Vincent-Pierre Berges, Barlas Oğuz, Daniel Haziza, Wen-tau Yih, Luke Zettlemoyer, and Gargi Ghosh. Memory layers at scale.arXiv preprint arXiv:2412.09764, 2024

arXiv 2024

[24] [24]

MLP memory: A retriever-pretrained memory for large language models.arXiv preprint arXiv:2508.01832, 2025

Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, and Zhouhan Lin. MLP memory: A retriever-pretrained memory for large language models.arXiv preprint arXiv:2508.01832, 2025

arXiv 2025

[25] [25]

Titans: Learning to memorize at test time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time. arXiv preprint arXiv:2501.00663, 2025

Pith/arXiv arXiv 2025

[26] [26]

Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, and Andy Toulis

Jikun Kang, Wenqi Wu, Filippos Christianos, Alex J. Chan, Fraser Greenlee, George Thomas, Marvin Purtorab, and Andy Toulis. LM2: Large memory models.arXiv preprint arXiv:2502.06049, 2025. 34

arXiv 2025

[27] [27]

Shangyi Geng, Wenting Zhao, and Alexander M. Rush. Great memory, shallow reasoning: Limits of kNN-LMs. InProceedings of NAACL (Short Papers), 2025. arXiv:2408.11815; first posted 2024

arXiv 2025

[28] [28]

Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Fast model editing at scale. InInternational Conference on Learning Representations (ICLR), 2022. arXiv:2110.11309 (MEND)

arXiv 2022

[29] [29]

Mass- editing memory in a transformer

Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass- editing memory in a transformer. InInternational Conference on Learning Representations (ICLR), 2023. arXiv:2210.07229 (MEMIT)

Pith/arXiv arXiv 2023

[30] [30]

Manning, Christopher Potts, and Danqi Chen

Zexuan Zhong, Zhengxuan Wu, Christopher D. Manning, Christopher Potts, and Danqi Chen. MQuAKE: Assessing knowledge editing in language models via multi-hop questions.arXiv preprint arXiv:2305.14795, 2023

arXiv 2023

[31] [31]

Adaptive loops and memory in transformers: Think harder or know more?arXiv preprint arXiv:2603.08391, 2026

Markus Frey, Behzad Shomali, Ali Hamza Bashir, David Berghaus, Joachim Koehler, and Mehdi Ali. Adaptive loops and memory in transformers: Think harder or know more?arXiv preprint arXiv:2603.08391, 2026. Latent & Implicit Thinking Workshop @ ICLR 2026

arXiv 2026

[32] [32]

Universal transformers need memory: Depth-state trade-offs in adaptive recursive reasoning.arXiv preprint arXiv:2604.21999, 2026

Grigory Sapunov. Universal transformers need memory: Depth-state trade-offs in adaptive recursive reasoning.arXiv preprint arXiv:2604.21999, 2026. 35

Pith/arXiv arXiv 2026