arxiv: 2605.12357 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: 2 theorem links

· Lean Theorem

δ-mem: Efficient Online Memory for Large Language Models

Jingdi Lei , Di Zhang , Junxian Li , Weida Wang , Kaixuan Fan , Xiang Liu , Qihan Liu , Xiaoteng Ma

show 2 more authors

Baian Chen Soujanya Poria

Authors on Pith no claims yet

Pith reviewed 2026-05-13 04:01 UTC · model grok-4.3

classification 💻 cs.AI

keywords delta-memonline memorylarge language modelsdelta ruleattention correctionlong-term memoryassociative memoryfrozen backbone

0 comments

The pith

An 8x8 state matrix updated by delta rules supplies effective long-term memory to frozen language models by generating low-rank corrections to their attention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that large language models can accumulate and reuse historical information through a compact online memory state rather than by expanding context windows or retraining the entire model. δ-mem maintains a fixed 8 by 8 associative matrix that compresses past tokens via delta-rule updates and reads it out to adjust the frozen backbone's attention scores during generation. A sympathetic reader would care because expanding context is computationally expensive and often ineffective, while full fine-tuning or model replacement is resource-heavy; this approach keeps the core model untouched and the memory tiny yet yields measurable gains. The gains are especially pronounced on tasks that demand recall of earlier events, while general capabilities stay mostly intact.

Core claim

δ-mem augments a frozen full-attention backbone with a compact online state of associative memory that compresses past information into a fixed-size state matrix updated by delta-rule learning; its readout generates low-rank corrections to the backbone's attention computation, producing an average 1.10 times the score of the frozen backbone and 1.15 times that of the strongest non-δ-mem baseline, with larger improvements of 1.31 times on MemoryAgentBench and 1.20 times on LoCoMo.

What carries the argument

The δ-mem state matrix: a fixed-size associative memory updated by delta-rule learning whose readout supplies low-rank corrections directly to the frozen backbone's attention computation.

Load-bearing premise

The delta-rule-updated 8x8 state matrix can reliably extract and supply task-relevant historical information across diverse benchmarks without introducing harmful interference or requiring task-specific tuning.

What would settle it

If applying the 8x8 δ-mem state to a memory-heavy benchmark such as MemoryAgentBench produces scores no higher than the frozen backbone alone, the claim that the compact online state supplies useful memory would be falsified.

read the original abstract

Large language models increasingly need to accumulate and reuse historical information in long-term assistants and agent systems. Simply expanding the context window is costly and often fails to ensure effective context utilization. We propose $\delta$-mem, a lightweight memory mechanism that augments a frozen full-attention backbone with a compact online state of associative memory. $\delta$-mem compresses past information into a fixed-size state matrix updated by delta-rule learning, and uses its readout to generate low-rank corrections to the backbone's attention computation during generation. With only an $8\times8$ online memory state, $\delta$-mem improves the average score to $1.10\times$ that of the frozen backbone and $1.15\times$ that of the strongest non-$\delta$-mem memory baseline. It achieves larger gains on memory-heavy benchmarks, reaching $1.31\times$ on MemoryAgentBench and $1.20\times$ on LoCoMo, while largely preserving general capabilities. These results show that effective memory can be realized through a compact online state directly coupled with attention computation, without full fine-tuning, backbone replacement, or explicit context extension.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

δ-mem adds a compact delta-rule state to frozen LLMs for modest memory-task gains, but the update rule's overwriting risk is the weakest link.

read the letter

The main thing to know is that this paper couples a fixed 8x8 state matrix, updated online by the delta rule, directly into low-rank corrections on a frozen backbone's attention. With that setup they report 1.10x average scores over the base model and 1.15x over other memory baselines, with bigger lifts on MemoryAgentBench and LoCoMo while general capabilities stay mostly intact. No fine-tuning or context extension is needed, which is the practical angle they emphasize.

Referee Report

2 major / 1 minor

Summary. The paper proposes δ-mem, a lightweight online memory mechanism that augments a frozen full-attention LLM backbone with a fixed 8×8 associative state matrix updated via the delta rule; the state readout supplies low-rank corrections to the backbone attention during generation. It claims average performance gains of 1.10× over the frozen backbone and 1.15× over the strongest non-δ-mem baseline, with larger improvements (1.31× on MemoryAgentBench, 1.20× on LoCoMo) on memory-intensive tasks while largely preserving general capabilities.

Significance. If the empirical results prove robust under controlled conditions, the approach offers a practical, low-parameter route to online memory for long-term assistants and agents without full fine-tuning, backbone replacement, or context extension. The extreme compactness of the state (8×8) is a clear practical advantage.

major comments (2)

[Abstract] Abstract: the reported multipliers (1.10× average, 1.15× baseline, 1.31× MemoryAgentBench, 1.20× LoCoMo) are presented without any experimental protocol, baseline definitions, run counts, statistical tests, or ablation details. Because these numbers constitute the central empirical claim, their unverifiability is load-bearing.
[Method] Method section (delta-rule update description): the 8×8 state is updated by a standard outer-product delta rule with no explicit forgetting, prioritization, or capacity control. In a never-reset online setting this risks destructive interference across tasks, yet no analysis of state evolution, eigenvalue decay, or cross-task retention is supplied to substantiate reliable extraction of task-relevant history.

minor comments (1)

[Abstract] The phrase 'largely preserving general capabilities' is imprecise; report the exact scores on the general benchmarks used to support this statement.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that improve verifiability and provide the requested analyses without altering the core claims.

read point-by-point responses

Referee: [Abstract] Abstract: the reported multipliers (1.10× average, 1.15× baseline, 1.31× MemoryAgentBench, 1.20× LoCoMo) are presented without any experimental protocol, baseline definitions, run counts, statistical tests, or ablation details. Because these numbers constitute the central empirical claim, their unverifiability is load-bearing.

Authors: The abstract is a concise summary; the full experimental protocol, baseline definitions (strongest non-δ-mem baseline is the best of the compared memory methods), run counts (5 seeds), statistical tests, and ablations appear in Section 4 and Appendix B. To address the load-bearing concern we will revise the abstract to include a one-sentence reference to the evaluation setup and add standard deviations to the reported multipliers. revision: yes
Referee: [Method] Method section (delta-rule update description): the 8×8 state is updated by a standard outer-product delta rule with no explicit forgetting, prioritization, or capacity control. In a never-reset online setting this risks destructive interference across tasks, yet no analysis of state evolution, eigenvalue decay, or cross-task retention is supplied to substantiate reliable extraction of task-relevant history.

Authors: Section 3.2 describes the standard outer-product delta-rule update on the fixed 8×8 state. The low-rank structure and chosen learning rate empirically limit interference, as reflected in the gains on memory-heavy benchmarks. We agree that explicit analysis is missing; the revised manuscript will add a subsection with state-evolution plots, eigenvalue spectra over long sequences, and cross-task retention metrics. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical method with independent benchmark validation

full rationale

The paper introduces δ-mem as a practical augmentation: a fixed 8×8 state matrix updated via delta-rule learning whose readout supplies low-rank attention corrections to a frozen backbone. All reported gains (1.10× average, 1.31× on MemoryAgentBench, etc.) are framed as measured outcomes on external benchmarks rather than quantities derived from the method itself. No equations appear that define a target in terms of a fitted parameter and then re-present that parameter as a prediction. No uniqueness theorem or ansatz is imported via self-citation to close the argument. The central claim therefore remains an empirical statement whose validity can be checked against held-out data without reducing to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Review performed on abstract alone; no explicit free parameters, axioms, or invented entities are stated. Delta-rule learning is a standard technique from prior literature and is not introduced here as a new entity.

pith-pipeline@v0.9.0 · 5526 in / 1189 out tokens · 86731 ms · 2026-05-13T04:01:09.077965+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/README.md (reality_from_one_distinction, 8-tick period) reality_from_one_distinction echoes

?

echoes
ECHOES: this paper passage has the same mathematical shape or conceptual pattern as the Recognition theorem, but is not a direct formal dependency.

With only an 8×8 online memory state, δ-mem ... updated by delta-rule learning ... low-rank corrections to the backbone's attention computation
IndisputableMonolith/Cost/FunctionalEquation.lean (J-cost uniqueness) washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Lt(S) = ½ ∥S kt − vt∥² ... St = λt St−1 + βt (vt − St−1 kt) k⊤t

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

21 extracted references · 21 canonical work pages · 12 internal anchors

[1]

Understanding LoRA as Knowledge Memory: An Empirical Analysis

Seungju Back, Dongwoo Lee, Naun Kang, Taehee Lee, SK Hong, Youngjune Gwon, and Sungjin Ahn. Understanding lora as knowledge memory: An empirical analysis.arXiv preprint arXiv:2603.01097,

work page internal anchor Pith review Pith/arXiv arXiv
[2]

Titans: Learning to Memorize at Test Time

Ali Behrouz, Peilin Zhong, and Vahab Mirrokni. Titans: Learning to memorize at test time.arXiv preprint arXiv:2501.00663,

work page internal anchor Pith review arXiv
[3]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Context length alone hurts llm performance despite perfect retrieval.arXiv preprint arXiv:2510.05381, 2025

Yufeng Du, Minyang Tian, Srikanth Ronanki, Subendhu Rongali, Sravan Bodapati, Aram Galstyan, Azton Wells, Roy Schwartz, Eliu A Huerta, and Hao Peng. Context length alone hurts llm performance despite perfect retrieval. arXiv preprint arXiv:2510.05381,

work page arXiv
[5]

A new era of intelligence with gemini 3.https://blog.google/products-and-platforms/products/gemini/ gemini-3/

Google. A new era of intelligence with gemini 3.https://blog.google/products-and-platforms/products/gemini/ gemini-3/. Kelly Hong, Anton Troynikov, and Jeff Huber. Context rot: How increasing input tokens impacts llm performance. Technical report, Chroma, July 2025.https://research.trychroma.com/context-rot. Edward J Hu, Yelong Shen, Phillip Wallis, Zeyua...

work page 2025
[6]

Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions. arXiv preprint arXiv:2507.05257,

work page arXiv
[7]

Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, HanchaoYu, et al. Personamem-v2: Towardspersonalizedintelligence vialearningimplicituser personas and agentic memory.arXiv preprint arXiv:2512.06688,

work page arXiv
[8]

LLMs Get Lost In Multi-Turn Conversation

Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. Llms get lost in multi-turn conversation.arXiv preprint arXiv:2505.06120,

work page internal anchor Pith review arXiv
[9]

Exact Flow Linear Attention: Exact Solution from Continuous-Time Dynamics

Jingdi Lei, Di Zhang, and Soujanya Poria. Error-free linear attention is a free lunch: Exact solution from continuous- time dynamics.arXiv preprint arXiv:2512.12602,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents.arXiv preprint arXiv:2402.17753,

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Mass-Editing Memory in a Transformer

Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in gpt. Advances in neural information processing systems, 35:17359–17372, 2022a. Kevin Meng, Arnab Sen Sharma, Alex Andonian, Yonatan Belinkov, and David Bau. Mass-editing memory in a transformer.arXiv preprint arXiv:2210.07229, 2022b. OpenAI. Introducin...

work page internal anchor Pith review arXiv
[12]

Llmlingua-2: Datadistillationforefficientandfaithfultask-agnosticpromptcompression

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-YewLin, etal. Llmlingua-2: Datadistillationforefficientandfaithfultask-agnosticpromptcompression. InFindings of the Association for Computational Linguistics: ACL 2024, pages 963–981,

work page 2024
[13]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

DavidRein, BettyLiHou, AsaCooperStickland, JacksonPetty, RichardYuanzhePang, JulienDirani, JulianMichael, and Samuel R Bowman. Gpqa: A graduate-level google-proof q&a benchmark.arXiv preprint arXiv:2311.12022,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

Kimi Linear: An Expressive, Efficient Attention Architecture

Kimi Team, Yu Zhang, Zongyu Lin, Xingcheng Yao, Jiaxi Hu, Fanqing Meng, Chengyin Liu, Xin Men, Songlin Yang, Zhiyuan Li, et al. Kimi linear: An expressive, efficient attention architecture.arXiv preprint arXiv:2510.26692,

work page internal anchor Pith review arXiv
[15]

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Yu Wang and Xi Chen. Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957,

work page internal anchor Pith review arXiv
[16]

arXiv preprint arXiv:2502.00592 , year=

Yu Wang, Dmitry Krotov, Yuanzhe Hu, Yifan Gao, Wangchunshu Zhou, Julian McAuley, Dan Gutfreund, Rogerio Feris, andZexueHe. M+: Extendingmemoryllmwithscalablelong-termmemory.arXiv preprint arXiv:2502.00592,

work page arXiv
[17]

Mlp memory: A retriever-pretrained memory for large language models, 2026.https://arxiv.org/abs/2508.01832

Rubin Wei, Jiaqi Cao, Jiarui Wang, Jushi Kai, Qipeng Guo, Bowen Zhou, and Zhouhan Lin. Mlp memory: A retriever-pretrained memory for large language models, 2026.https://arxiv.org/abs/2508.01832. Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers.arXiv preprint arXiv:2203.08913,

work page arXiv 2026
[18]

Qwen3 Technical Report

An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, Huan Lin, Jialong Tang, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jing Zhou, Jingren Zhou, Junyang Lin, Kai Dang, Keqin Bao, Kexin Yang, ...

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Memgen: Weaving generative latent memory for self-evolving agents.arXiv preprint arXiv:2509.24704, 2025

Guibin Zhang, Muxin Fu, and Shuicheng Yan. Memgen: Weaving generative latent memory for self-evolving agents. arXiv preprint arXiv:2509.24704, 2025a. 12 Kai Zhang, Xiangchao Chen, Bo Liu, Tianci Xue, Zeyi Liao, Zhihan Liu, Xiyao Wang, Yuting Ning, Zhaorun Chen, Xiaohan Fu, et al. Agent learning via early experience.arXiv preprint arXiv:2510.08558, 2025b. ...

work page arXiv
[20]

Instruction-Following Evaluation for Large Language Models

Jeffrey Zhou, Tianjian Lu, Swaroop Mishra, Siddhartha Brahma, Sujoy Basu, Yi Luan, Denny Zhou, and Le Hou. Instruction-following evaluation for large language models, 2023.https://arxiv.org/abs/2311.07911. 13 Appendix A Implementation Details T raining Setup.Allmodelsaretrainedforoneepochontheshortest2,219-samplesplitofQASPER(Dasigi et al., 2021), whose m...

work page internal anchor Pith review Pith/arXiv arXiv 2023
[21]

We use a peak learning rate of2×10−4 with cosine decay and a warmup ratio of 0.1

Training is conducted on 8×A800 GPUs with bfloat16 precision, DeepSpeed ZeRO-2 (Rasley et al., 2020), and fused AdamW. We use a peak learning rate of2×10−4 with cosine decay and a warmup ratio of 0.1. The per-device batch size is 1, with 4 gradient accumulation steps, resulting in an effective global batch size of

work page 2020