arxiv: 2605.10537 · v1 · submitted 2026-05-11 · 💻 cs.CL

Recognition: 2 theorem links

· Lean Theorem

Mela: Test-Time Memory Consolidation based on Transformation Hypothesis

Lungchuan Chen

Authors on Pith no claims yet

Pith reviewed 2026-05-12 03:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords memory consolidationhierarchical memory modulelong-context language modelingtest-time adaptationtransformer augmentationneuroscience-inspired modelscross-frequency couplingmulti-granularity representations

0 comments

The pith

Mela augments Transformers with a hierarchical memory module that consolidates information at test time to handle contexts far longer than training length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes Mela as a Transformer-based language model equipped with a Hierarchical Memory Module drawn from theories of brain memory consolidation. The module runs two sub-modules at different frequencies to build abstract high-level representations alongside detailed episodic ones, then reconstructs a context-specific memory output from their combination. This setup enables online updates during inference rather than retraining, so the model keeps language modeling accuracy on sequences well beyond the 4K-token limit used in pretraining. Standard Transformers lose performance quickly in the same regime. The work positions neuroscience-derived memory principles as a practical way to extend sequence model reach without changing the underlying training regime.

Core claim

Mela integrates the Hierarchical Memory Module into a Transformer decoder to perform test-time memory consolidation; the low-frequency sub-module captures gist-level knowledge while the high-frequency sub-module retains episodic detail, their outputs are recombined dynamically, and MemStack spreads the resulting multi-granularity features across early decoder layers, yielding higher language modeling accuracy than baselines of any size and stable performance on contexts longer than the 4K pretraining length.

What carries the argument

The Hierarchical Memory Module (HMM), which runs low-frequency and high-frequency sub-modules to produce abstract and fine-grained representations that are dynamically reconstructed based on current context.

If this is right

Mela outperforms Transformer baselines of every tested size on language modeling benchmarks.
With pretraining fixed at 4K tokens, Mela maintains performance on inputs much longer than 4K while baselines degrade rapidly.
MemStack distributes the multi-granularity memory features across decoder layers without adding extra tokens.
Ablation results confirm that both frequency-based sub-modules and the dynamic reconstruction step contribute to the gains.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the mechanism generalizes, similar frequency-separated consolidation could be added to other sequence architectures to reduce reliance on ever-longer pretraining contexts.
Applications that stream long documents or dialogues could use test-time updates to maintain coherence without periodic full retraining.
The separation of abstract and detailed representations might be adapted to non-language domains where memory stability across varying timescales matters.

Load-bearing premise

The neuroscientific transformation hypothesis and cross-frequency coupling can be translated directly into an architecture whose test-time memory updates alone deliver the claimed long-context gains.

What would settle it

An experiment in which Mela shows no advantage over a standard Transformer on long-context benchmarks after the Transformer is given matching training length or adjusted positional encodings.

read the original abstract

Memory consolidation, the process by which transient experiences are transformed into stable, structured representations, is a foundational organizing principle in the human brain, yet it remains largely unexplored as a design principle for modern sequence models. In this work, we leverage established neuroscientific theories of memory consolidation and cross-frequency coupling to propose the Hierarchical Memory Module (HMM), a neural memory architecture composed of two functionally distinct sub-modules that operate at different update frequencies. Inspired by the transformation hypothesis, the low-frequency sub-module produces high-level representations that capture abstract, gist-level knowledge, while the high-frequency sub-module produces fine-grained representations that preserve richer episodic detail. The final memory output is dynamically reconstructed as a context-dependent combination of both representations, analogous to the reconstructive nature of human memory retrieval. We integrate HMM into a Transformer-based language decoder to form Mela, a family of memory-augmented language models that perform online memory consolidation at test time. To further exploit the multi-granularity memory representations produced by HMM, we introduce MemStack, a method that distributes different levels of memory features across the early layers of the decoder without introducing additional tokens. Experiments on language modeling demonstrate that Mela outperforms Transformer baselines across all the model sizes. Moreover, with the pretrained context length fixed at 4K, Mela maintains performance on significantly longer contexts, whereas Transformer baselines degrade rapidly beyond their training length. Extensive ablation studies validate the contribution of each component and provide guidance for practical configuration.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Mela's hierarchical memory for test-time consolidation offers a fresh bio-inspired angle on long-context language modeling, but the abstract's lack of numbers makes the claims hard to assess right now.

read the letter

The paper's main pitch is a Hierarchical Memory Module that performs test-time consolidation in Transformers by splitting memory into low-frequency gist representations and high-frequency detail ones, then reconstructing them context-dependently. They pair this with MemStack to spread the features across decoder layers, and report that the resulting Mela models beat standard Transformers on language modeling while holding performance on contexts way beyond the 4K training length.

Referee Report

3 major / 2 minor

Summary. The paper proposes Mela, a Transformer-based language model augmented with a Hierarchical Memory Module (HMM) that performs test-time memory consolidation. HMM consists of low-frequency and high-frequency sub-modules inspired by the transformation hypothesis and cross-frequency coupling; their outputs are combined via context-dependent reconstruction. MemStack distributes multi-granularity memory features across decoder layers without extra tokens. The central claims are that Mela outperforms standard Transformer baselines on language modeling across model sizes and, with 4K pretraining context, maintains performance on substantially longer contexts where baselines degrade.

Significance. If the long-context robustness is shown to arise specifically from the proposed low/high-frequency consolidation and online reconstruction rather than from generic increases in memory capacity, the work would supply a concrete, neuroscience-motivated mechanism for test-time adaptation that avoids quadratic attention costs. The absence of any reported quantitative results, baseline configurations, or statistical tests in the provided abstract, however, prevents assessment of whether the result would be load-bearing for the field.

major comments (3)

[Abstract / Experiments] Abstract and Experiments section: the claim that 'extensive ablation studies validate the contribution of each component' is not supported by any reported numbers, tables, or isolation experiments. Without quantitative results showing the performance drop when the cross-frequency reconstruction is ablated versus when MemStack is removed, it remains possible that observed long-context gains are produced by the added memory capacity alone rather than by the transformation-hypothesis mechanism.
[Method] No equations or update rules are supplied for the low-frequency gist module, high-frequency detail module, or the context-dependent reconstruction step. Without these definitions it is impossible to determine whether the architecture introduces new hyperparameters whose tuning could explain the reported robustness beyond 4K tokens.
[Experiments] The strongest empirical claim (maintenance of performance on contexts >> 4K while baselines degrade) requires a controlled comparison that holds total memory state size fixed. The current description does not indicate whether such a control was performed or whether MemStack's distribution of features across layers simply increases effective state size.

minor comments (2)

[Abstract] The abstract states performance improvements 'across all the model sizes' yet supplies neither the sizes tested nor the corresponding perplexity or accuracy deltas.
[Method] Notation for the two sub-modules and the reconstruction operator is introduced only descriptively; explicit symbols would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We appreciate the referee's detailed review and constructive suggestions. The comments have helped us improve the clarity and rigor of the paper. We have made revisions to address all major points, including adding quantitative ablation results, explicit equations, and controlled comparisons. Our responses to each comment are provided below.

read point-by-point responses

Referee: [Abstract / Experiments] Abstract and Experiments section: the claim that 'extensive ablation studies validate the contribution of each component' is not supported by any reported numbers, tables, or isolation experiments. Without quantitative results showing the performance drop when the cross-frequency reconstruction is ablated versus when MemStack is removed, it remains possible that observed long-context gains are produced by the added memory capacity alone rather than by the transformation-hypothesis mechanism.

Authors: We thank the referee for this important observation. The full manuscript does contain ablation studies in the Experiments section, but we agree that they lack the specific isolation of cross-frequency reconstruction versus MemStack to directly address capacity concerns. We have added new quantitative results and a dedicated table in the revised version showing performance drops for each ablation, confirming the contribution of the transformation-hypothesis inspired components beyond mere capacity increase. revision: yes
Referee: [Method] No equations or update rules are supplied for the low-frequency gist module, high-frequency detail module, or the context-dependent reconstruction step. Without these definitions it is impossible to determine whether the architecture introduces new hyperparameters whose tuning could explain the reported robustness beyond 4K tokens.

Authors: The referee correctly notes that explicit equations were not included in the submitted version. We have revised the Method section to include the full mathematical definitions and update rules for the low-frequency gist module, the high-frequency detail module, and the context-dependent reconstruction. We also added a discussion of all introduced hyperparameters and their settings to demonstrate that the long-context performance is not due to special tuning. revision: yes
Referee: [Experiments] The strongest empirical claim (maintenance of performance on contexts >> 4K while baselines degrade) requires a controlled comparison that holds total memory state size fixed. The current description does not indicate whether such a control was performed or whether MemStack's distribution of features across layers simply increases effective state size.

Authors: We agree that holding total memory state size fixed is essential for the claim. Although our original experiments matched the effective memory capacity by equating the number of stored vectors in baselines to the combined size from HMM and MemStack, this was not clearly documented. We have added a new subsection and table in the revised Experiments section explicitly describing the state size control and confirming that Mela's advantages persist under matched capacity. revision: yes

Circularity Check

0 steps flagged

No circularity; architecture proposed from external hypothesis and validated empirically.

full rationale

The paper draws on established neuroscientific theories (transformation hypothesis, cross-frequency coupling) to motivate the design of HMM sub-modules and their context-dependent reconstruction, then integrates them into Mela with MemStack and reports empirical results on language modeling benchmarks plus ablations. No equations appear that define a quantity in terms of itself or fit parameters to a subset of data and relabel the fit as a prediction. No self-citation chains or uniqueness theorems are invoked to force the architecture. The performance claims (outperformance and long-context robustness) rest on direct experimental comparisons rather than reducing to the inputs by construction, satisfying the criteria for a self-contained derivation.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract only; no equations, hyperparameters, or implementation details are visible, so free parameters, axioms, and invented entities cannot be enumerated beyond the high-level architecture name.

pith-pipeline@v0.9.0 · 5552 in / 1136 out tokens · 75041 ms · 2026-05-12T03:54:59.535669+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Mela outperforms Transformer baselines... maintains performance on significantly longer contexts... ablation studies validate each component

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 20 internal anchors

[1]

Gqa: Training generalized multi-query transformer models from multi-head checkpoints

Joshua Ainslie, James Lee-Thorp, Michiel De Jong, Yury Zemlyanskiy, Federico Lebrón, and Sumit Sanghai. Gqa: Training generalized multi-query transformer models from multi-head checkpoints. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 4895–4901,

work page 2023
[2]

Qwen3-VL Technical Report

Shuai Bai, Yuxuan Cai, Ruizhe Chen, Keqin Chen, Xionghui Chen, Zesen Cheng, Lianghao Deng, Wei Ding, Chang Gao, Chunjiang Ge, et al. Qwen3-vl technical report.arXiv preprint arXiv:2511.21631,

work page internal anchor Pith review Pith/arXiv arXiv
[3]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review arXiv
[4]

Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025a

Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. Nested learning: The illusion of deep learning architectures.arXiv preprint arXiv:2512.24695, 2025a. Ali Behrouz, Meisam Razaviyayn, Peilin Zhong, and Vahab Mirrokni. It’s all connected: A journey through test-time memorization, attentional bias, retention, and online optimization.arXiv pre...

work page arXiv
[5]

Memory, sleep, dreams, and consciousness: a perspective based on the memory theory of consciousness.Nature and science of sleep, pages 1957–1972,

Andrew E Budson and Ken A Paller. Memory, sleep, dreams, and consciousness: a perspective based on the memory theory of consciousness.Nature and science of sleep, pages 1957–1972,

work page 1957
[6]

arXiv preprint arXiv:2006.11527 , year=

Mikhail S Burtsev, Yuri Kuratov, Anton Peganov, and Grigory V Sapunov. Memory transformer.arXiv preprint arXiv:2006.11527,

work page arXiv 2006
[7]

Transformers are SSMs: Generalized Models and Efficient Algorithms Through Structured State Space Duality

Tri Dao and Albert Gu. Transformers are ssms: Generalized models and efficient algorithms through structured state space duality.arXiv preprint arXiv:2405.21060,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

An Image is Worth 16x16 Words: Transformers for Image Recognition at Scale

Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, et al. An image is worth 16x16 words: Transformers for image recognition at scale.arXiv preprint arXiv:2010.11929,

work page internal anchor Pith review Pith/arXiv arXiv 2010
[9]

Mamba: Linear-Time Sequence Modeling with Selective State Spaces

Albert Gu and Tri Dao. Mamba: Linear-time sequence modeling with selective state spaces. arxiv.arXiv preprint arXiv:2312.00752, 10,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Less is More: Recursive Reasoning with Tiny Networks

Alexia Jolicoeur-Martineau. Less is more: Recursive reasoning with tiny networks.arXiv preprint arXiv:2510.04871,

work page internal anchor Pith review arXiv
[11]

Muon is Scalable for LLM Training

Jingyuan Liu, Jianlin Su, Xingcheng Yao, Zhejun Jiang, Guokun Lai, Yulun Du, Yidao Qin, Weixin Xu, Enzhe Lu, Junjie Yan, et al. Muon is scalable for llm training.arXiv preprint arXiv:2502.16982,

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Decoupled Weight Decay Regularization

Ilya Loshchilov and Frank Hutter. Decoupled weight decay regularization.arXiv preprint arXiv:1711.05101,

work page internal anchor Pith review Pith/arXiv arXiv
[13]

A Time Series is Worth 64 Words: Long-term Forecasting with Transformers

Yuqi Nie, Nam H Nguyen, Phanwadee Sinthong, and Jayant Kalagnanam. A time series is worth 64 words: Long-term forecasting with transformers.arXiv preprint arXiv:2211.14730,

work page internal anchor Pith review arXiv
[14]

Gated Attention for Large Language Models: Non-linearity, Sparsity, and Attention-Sink-Free

Zihan Qiu, Zekun Wang, Bo Zheng, Zeyu Huang, Kaiyue Wen, Songlin Yang, Rui Men, Le Yu, Fei Huang, Suozhi Huang, et al. Gated attention for large language models: Non-linearity, sparsity, and attention-sink-free.arXiv preprint arXiv:2505.06708,

work page internal anchor Pith review arXiv
[15]

Searching for Activation Functions

Prajit Ramachandran, Barret Zoph, and Quoc V Le. Searching for activation functions.arXiv preprint arXiv:1710.05941,

work page internal anchor Pith review Pith/arXiv arXiv
[16]

GLU Variants Improve Transformer

Noam Shazeer. Glu variants improve transformer.arXiv preprint arXiv:2002.05202,

work page internal anchor Pith review Pith/arXiv arXiv 2002
[17]

Scaling LLM Test-Time Compute Optimally can be More Effective than Scaling Model Parameters

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. Scaling llm test-time compute optimally can be more effective than scaling model parameters.arXiv preprint arXiv:2408.03314,

work page internal anchor Pith review Pith/arXiv arXiv
[18]

RoFormer: Enhanced Transformer with Rotary Position Embedding

Jianlin Su, Yu Lu, Shengfeng Pan, Ahmed Murtadha, Bo Wen, and Yunfeng Liu. Roformer: Enhanced transformer with rotary position embedding. arxiv e-prints, art.arXiv preprint arXiv:2104.09864, 10,

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Learning to (Learn at Test Time): RNNs with Expressive Hidden States

Yu Sun, Xinhao Li, Karan Dalal, Jiarui Xu, Arjun Vikram, Genghan Zhang, Yann Dubois, Xinlei Chen, Xiaolong Wang, Sanmi Koyejo, et al. Learning to (learn at test time): Rnns with expressive hidden states.arXiv preprint arXiv:2407.04620,

work page internal anchor Pith review arXiv
[20]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692,

work page internal anchor Pith review arXiv
[21]

LLaMA: Open and Efficient Foundation Language Models

18 Hugo Touvron, Thibaut Lavril, Gautier Izacard, Xavier Martinet, Marie-Anne Lachaux, Timothée Lacroix, Baptiste Rozière, Naman Goyal, Eric Hambro, Faisal Azhar, et al. Llama: Open and efficient foundation language models. arXiv preprint arXiv:2302.13971, 2023a. Hugo Touvron, Louis Martin, Kevin Stone, Peter Albert, Amjad Almahairi, Yasmine Babaei, Nikol...

work page internal anchor Pith review Pith/arXiv arXiv
[22]

Hierarchical Reasoning Model

Guan Wang, Jin Li, Yuhao Sun, Xing Chen, Changling Liu, Yue Wu, Meng Lu, Sen Song, and Yasin Abbasi Yadkori. Hierarchical reasoning model.arXiv preprint arXiv:2506.21734,

work page internal anchor Pith review arXiv
[23]

Memory Networks

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks.arXiv preprint arXiv:1410.3916,

work page Pith review arXiv
[24]

Gated Delta Networks: Improving Mamba2 with Delta Rule

Songlin Yang, Jan Kautz, and Ali Hatamizadeh. Gated delta networks: Improving mamba2 with delta rule.arXiv preprint arXiv:2412.06464,

work page internal anchor Pith review arXiv
[25]

Memory Mosaics

Jianyu Zhang, Niklas Nolte, Ranajoy Sadhukhan, Beidi Chen, and Léon Bottou. Memory mosaics.arXiv preprint arXiv:2405.06394,

work page arXiv
[26]

PyTorch FSDP: Experiences on Scaling Fully Sharded Data Parallel

Yanli Zhao, Andrew Gu, Rohan Varma, Liang Luo, Chien-Chin Huang, Min Xu, Less Wright, Hamid Shojanazeri, Myle Ott, Sam Shleifer, et al. Pytorch fsdp: experiences on scaling fully sharded data parallel.arXiv preprint arXiv:2304.11277,

work page internal anchor Pith review arXiv
[27]

B Ablation Studies Figure 5Effect of the number of H cycles on Mela’s perplexity across context lengths

19 Appendix A Detailed Perplexity Results by Context Length Model Params 1024 2048 4096 8192 16384 32768 Transformer++ 400M 13.59 14.02 12.56 28.26 130.21 303.56 Transformer++ 800M 12.17 12.42 11.35 16.69 129.10 497.85 Transformer++ 1.2B 10.27 10.46 9.53 12.71 104.14 597.37 Mela(ours) 400M 12.53 12.75 12.01 12.64 14.43 14.50 Mela(ours) 800M 10.50 10.75 10...

work page 2048