arxiv: 2604.08216 · v2 · submitted 2026-04-09 · 💻 cs.MA

Recognition: 2 theorem links

· Lean Theorem

MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought

Haodong Lei , Junming Liu , Yirong Chen , Ding Wang , Hongsong Wang

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:43 UTC · model grok-4.3

classification 💻 cs.MA

keywords long-context reasoningmemory mechanismschain-of-thoughtlarge language modelstest-time scalinghallucination mitigationcontext fragmentation

0 comments

The pith

MemCoT converts long-context reasoning in language models into iterative stateful search using multi-view memory and dual short-term tracking.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MemCoT to fix hallucinations and catastrophic forgetting when large language models perform causal reasoning over massive fragmented contexts. It replaces static single-step retrieval with an iterative process that first locates relevant evidence and then expands surrounding causal structure. A task-conditioned dual short-term memory records past search decisions to guide further decomposition and pruning. If this holds, models can maintain coherence across long inputs without the dilution that breaks current approaches. The authors report state-of-the-art results on the LoCoMo and LongMemEval-S benchmarks for several open and closed models.

Core claim

MemCoT redefines the reasoning process by transforming long-context reasoning into an iterative, stateful information search. It introduces a multi-view long-term memory perception module that performs Zoom-In evidence localization followed by Zoom-Out contextual expansion, and it adds a task-conditioned dual short-term memory system consisting of semantic state memory and episodic trajectory memory that records historical search decisions and dynamically guides query decomposition and pruning across iterations.

What carries the argument

The multi-view long-term memory perception module for zoom-in localization and zoom-out expansion, paired with a task-conditioned dual short-term memory system of semantic state memory and episodic trajectory memory that tracks history to direct subsequent steps.

If this is right

Several open- and closed-source models achieve state-of-the-art results on the LoCoMo and LongMemEval-S benchmarks.
Reasoning shifts from passive retrieval matching to active iterative search that maintains causal structure.
Semantic dilution decreases because the model first isolates evidence then reconstructs surrounding context.
Historical search decisions stored in episodic trajectory memory allow dynamic pruning and better query decomposition over multiple steps.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same iterative memory loop could be tested on multi-document summarization tasks where context spans dozens of sources.
If the dual memory overhead proves low, deployment systems might adopt it as a lightweight wrapper around existing context windows.
The zoom-in then zoom-out pattern may generalize to visual reasoning agents that must locate then integrate evidence across image sequences.

Load-bearing premise

The proposed multi-view long-term memory and task-conditioned dual short-term memory will reliably reduce semantic dilution and contextual fragmentation without adding new failure modes or excessive overhead.

What would settle it

Apply MemCoT to models on the LoCoMo benchmark and observe whether hallucination rates and forgetting remain comparable to or higher than standard chain-of-thought baselines on the same long fragmented contexts.

Figures

Figures reproduced from arXiv: 2604.08216 by Ding Wang, Haodong Lei, Hongsong Wang, Junming Liu, Yirong Chen.

**Figure 1.** Figure 1: The answer prediction 𝐹1 scores (%) of GPT-4o-mini on the LoCoMo benchmark. MemCoT achieves a state-of-the-art overall 𝐹1 score of 58.03%, significantly outperforming all baseline memory systems across all sub-tasks. Abstract Large Language Models (LLMs) still suffer from severe hallucinations and catastrophic forgetting during causal reasoning over massive, fragmented long contexts. Existing memory mecha… view at source ↗

**Figure 2.** Figure 2: Analysis of reasoning and retrieval failures in long-context: (a) A relational diagram illustrating the dependencies [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: The overall architecture of the MemCoT framework. It illustrates the scaling test-time compute memory loop, driven [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Effects of different parameters: (a) Impact of the adjacent window size ( [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: The specific comparison of different methods. The orange mark indicates the keywords of the truth label. [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗

read the original abstract

Large Language Models (LLMs) still suffer from severe hallucinations and catastrophic forgetting during causal reasoning over massive, fragmented long contexts. Existing memory mechanisms typically treat retrieval as a static, single-step passive matching process, leading to severe semantic dilution and contextual fragmentation. To overcome these fundamental bottlenecks, we propose MemCoT, a test-time memory scaling framework that redefines the reasoning process by transforming long-context reasoning into an iterative, stateful information search. MemCoT introduces a multi-view long-term memory perception module that enables Zoom-In evidence localization and Zoom-Out contextual expansion, allowing the model to first identify where relevant evidence resides and then reconstruct the surrounding causal structure necessary for reasoning. In addition, MemCoT employs a task-conditioned dual short-term memory system composed of semantic state memory and episodic trajectory memory. This short-term memory records historical search decisions and dynamically guides query decomposition and pruning across iterations. Empirical evaluations demonstrate that MemCoT establishes a state-of-the-art performance. Empowered by MemCoT, several open- and closed-source models achieve SOTA performance on the LoCoMo benchmark and LongMemEval-S benchmark.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemCoT packages multi-view long-term memory and dual short-term memories into an iterative test-time loop for long-context reasoning, but the SOTA claims on LoCoMo and LongMemEval-S rest on details that are not visible in the abstract.

read the letter

The core idea is to replace static retrieval with an iterative process: a multi-view long-term memory that zooms in on evidence and zooms out to rebuild causal context, paired with task-conditioned semantic and episodic short-term memories that track past decisions and prune queries. This is a concrete, implementable extension of existing CoT and memory work rather than a wholly new paradigm, and it directly targets semantic dilution and fragmentation in long documents. That framing is useful for anyone trying to improve reliability on tasks like multi-hop QA over large contexts without fine-tuning. The architecture description is clear enough that a reader could sketch an implementation from the abstract alone. The main weakness is that the performance claims are stated without any numbers, baselines, iteration counts, ablations, or error bars. The stress-test concern about whether gains survive limits on iterations or removal of the episodic memory is still live; nothing in the provided text shows those controls, so it remains possible that reported improvements come mainly from extra test-time steps rather than the memory mechanisms themselves. No discussion of overhead or new error modes appears either. This paper is aimed at practitioners working on long-context LLM applications who want a drop-in test-time method. It is worth sending to peer review because the problem is real and the proposed structure is specific enough to evaluate, but any referee will need the full experimental section and ablations before the SOTA statements can be taken seriously.

Referee Report

3 major / 1 minor

Summary. The paper proposes MemCoT, a test-time scaling framework for LLMs that addresses hallucinations and catastrophic forgetting in long-context causal reasoning. It introduces a multi-view long-term memory perception module enabling Zoom-In evidence localization and Zoom-Out contextual expansion, plus a task-conditioned dual short-term memory system (semantic state memory and episodic trajectory memory) that records historical decisions to guide iterative query decomposition and pruning. The central claim is that this architecture transforms static retrieval into stateful iterative search and delivers SOTA results on the LoCoMo and LongMemEval-S benchmarks across open- and closed-source models.

Significance. If the performance claims are substantiated with proper controls, MemCoT would offer a practical advance in test-time memory mechanisms for long-context reasoning, directly targeting semantic dilution and fragmentation that plague existing retrieval-augmented approaches. The iterative, multi-view design could generalize to other benchmarks requiring sustained causal structure over fragmented inputs.

major comments (3)

[Abstract and §5] Abstract and §5 (Empirical Evaluations): The SOTA claims on LoCoMo and LongMemEval-S are stated without any description of experimental setup, chosen baselines, number of iterations, error bars, statistical significance, or ablation studies. This prevents assessment of whether gains arise from the proposed multi-view long-term memory and dual short-term memory or simply from additional test-time compute.
[§4] §4 (Method, dual short-term memory): The task-conditioned semantic and episodic trajectory memories are described at a high level, but no formal specification, pseudocode, or analysis of the iterative loop is provided. In particular, there is no demonstration that performance deltas survive removal of the episodic trajectory memory or explicit caps on maximum iterations, leaving open the possibility of compounding errors in query decomposition and pruning.
[§5] §5 (Results): The central claim that MemCoT establishes SOTA requires evidence that the reported improvements are robust to controls for iteration count and error propagation in the dual-memory loop. Without ablations isolating the Zoom-In/Zoom-Out module and the episodic memory component, the benchmark results cannot be attributed to the architecture rather than unaccounted compute.

minor comments (1)

[Abstract] Abstract: Consider adding one sentence specifying the models and exact benchmark scores to make the SOTA claim more concrete for readers.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We appreciate the opportunity to clarify the experimental rigor and methodological details of MemCoT. We address each major comment below and have prepared revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Abstract and §5] Abstract and §5 (Empirical Evaluations): The SOTA claims on LoCoMo and LongMemEval-S are stated without any description of experimental setup, chosen baselines, number of iterations, error bars, statistical significance, or ablation studies. This prevents assessment of whether gains arise from the proposed multi-view long-term memory and dual short-term memory or simply from additional test-time compute.

Authors: We agree that the abstract and §5 would benefit from greater transparency regarding the experimental protocol. In the revised manuscript, we will expand the abstract with a concise summary of key setup parameters and substantially augment §5 to include a full description of the experimental setup, all chosen baselines, the specific number of iterations used in the iterative search, error bars computed over multiple runs, statistical significance testing, and targeted ablation studies. These additions will allow readers to evaluate whether the reported gains derive from the multi-view long-term memory and dual short-term memory components rather than unaccounted test-time compute. revision: yes
Referee: [§4] §4 (Method, dual short-term memory): The task-conditioned semantic and episodic trajectory memories are described at a high level, but no formal specification, pseudocode, or analysis of the iterative loop is provided. In particular, there is no demonstration that performance deltas survive removal of the episodic trajectory memory or explicit caps on maximum iterations, leaving open the possibility of compounding errors in query decomposition and pruning.

Authors: We acknowledge that §4 currently presents the dual short-term memory system at a conceptual level. We will revise this section to include formal mathematical specifications for the semantic state memory and episodic trajectory memory, along with pseudocode for the complete iterative loop. We will also add new experiments that isolate the contribution of the episodic trajectory memory (performance with and without it) and that apply explicit caps on the maximum number of iterations, thereby addressing concerns about potential compounding errors in query decomposition and pruning. revision: yes
Referee: [§5] §5 (Results): The central claim that MemCoT establishes SOTA requires evidence that the reported improvements are robust to controls for iteration count and error propagation in the dual-memory loop. Without ablations isolating the Zoom-In/Zoom-Out module and the episodic memory component, the benchmark results cannot be attributed to the architecture rather than unaccounted compute.

Authors: We agree that additional controls and ablations are necessary to substantiate the attribution of gains to the proposed architecture. The revised §5 will incorporate new ablation studies that separately disable the Zoom-In/Zoom-Out long-term memory perception module and the episodic memory component. We will further include results under varying iteration counts and an analysis of error propagation within the dual-memory loop. These experiments will demonstrate the robustness of the SOTA performance to the specific design choices in MemCoT. revision: yes

Circularity Check

0 steps flagged

No derivation chain or equations present; purely empirical architecture description

full rationale

The paper introduces MemCoT as a test-time scaling framework with multi-view long-term memory and dual short-term memory components, but contains no equations, first-principles derivations, or mathematical predictions. Claims rest exclusively on empirical SOTA results on LoCoMo and LongMemEval-S benchmarks. No load-bearing step reduces to a fitted input, self-citation, or self-definition by construction. The architecture is described at the conceptual level without any reduction to its own inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities are described. The framework implicitly assumes that iterative memory search can be implemented without destabilizing the base LLM and that benchmark gains reflect genuine reasoning improvement.

pith-pipeline@v0.9.0 · 5505 in / 1166 out tokens · 25769 ms · 2026-05-10T17:43:09.290764+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

MemCoT ... multi-view long-term memory perception module ... task-conditioned dual short-term memory system ... maximum evolutionary iterations J is uniformly set to 8
IndisputableMonolith/Foundation/AlexanderDuality.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

J=8 ... query decomposition and pruning across iterations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

57 extracted references · 32 canonical work pages · 13 internal anchors

[1]

Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi
[2]

InThe Twelfth International Conference on Learning Representations

Self-RAG: Learning to Retrieve, Generate, and Critique through Self- Reflection. InThe Twelfth International Conference on Learning Representations. https://openreview.net/forum?id=hSyW5go0v8
[3]

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

work page doi:10.18653/v1/2024.acl-long.172 2024
[4]

Sinchana Ramakanth Bhat, Max Rudat, Jannis Spiekermann, and Nicolas Flores- Herr. 2025. Rethinking Chunk Size For Long-Document Retrieval: A Multi- Dataset Analysis. arXiv:2505.21700 [cs.IR]

work page arXiv 2025
[5]

Pu Cao, Feng Zhou, Junyi Ji, Qingye Kong, Zhixiang Lv, Mingjian Zhang, Xuekun Zhao, Siqi Wu, Yinghui Lin, Qing Song, and Lu Yang. 2025. Preliminary Explo- rations with GPT-4o(mni) Native Image Generation. arXiv:2505.05501 [cs.CV] https://arxiv.org/abs/2505.05501

work page arXiv 2025
[6]

Thomas Y Chen. 2025. How Many Parameters for Multi-Hop? An Information- Theoretic Capacity Law for Knowledge Retrieval in Large Language Models. In Knowledgeable Foundation Models at ACL 2025

2025
[7]

Xin Cheng, Wangding Zeng, Damai Dai, Qinyu Chen, Bingxuan Wang, Zhenda Xie, Kezhao Huang, Xingkai Yu, Zhewen Hao, Yukun Li, Han Zhang, Huishuai Zhang, Dongyan Zhao, and Wenfeng Liang. 2026. Conditional Memory via Scalable Lookup: A New Axis of Sparsity for Large Language Models. arXiv:2601.07372 [cs.CL]

work page arXiv 2026
[8]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav
[9]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413 [cs.CL] https://arxiv.org/abs/2504.19413

work page internal anchor Pith review arXiv
[10]

Florin Cuconasu, Giovanni Trappolini, Federico Siciliano, Simone Filice, Cesare Campagnano, Yoelle Maarek, Nicola Tonellotto, and Fabrizio Silvestri. 2024. The Power of Noise: Redefining Retrieval for RAG Systems. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR 2024). ACM

2024
[11]

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2025. LightRAG: Simple and Fast Retrieval-Augmented Generation. InFindings of the Association for Computational Linguistics: EMNLP 2025, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, 10746–10761

2025
[12]

Bernal Jiménez Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su
[13]

Hipporag: Neurobiologically inspired long-term memory for large language models,

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. arXiv:2405.14831 [cs.CL]

work page arXiv
[14]

Yuyang Hu, Jiongnan Liu, Jiejun Tan, Yutao Zhu, and Zhicheng Dou. 2026. Mem- ory Matters More: Event-Centric Memory as a Logic Map for Agent Searching and Reasoning. arXiv:2601.04726 [cs.AI] https://arxiv.org/abs/2601.04726

work page arXiv 2026
[15]

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page internal anchor Pith review arXiv 2026
[16]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Information Processing Systems, H. Larochelle, M. Ranzato, R. Hadsell, M.F...

2020
[17]

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-augmented generation for knowledge-intensive NLP tasks. InProceedings of the 34th International Conference on Neural Information Processing Systems. Cu...

2020
[18]

Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, and Ruiming Tang. 2025. CAM: A Constructivist View of Agentic Memory for LLM-Based Reading Comprehension. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id= ACSOnSHiWe

2025
[19]

Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, Zhen Tao, Huayi Lai, Hao Wu, Bo Tang, Zhengren Wang, Zhaoxin Fan, Ningyu Zhang, Linfeng Zhang, Junchi Yan, Mingchuan ...

work page internal anchor Pith review arXiv 2025
[20]

Junming Liu, Yifei Sun, Weihua Cheng, Haodong Lei, Yirong Chen, Licheng Wen, Xuemeng Yang, Daocheng Fu, Pinlong Cai, Nianchen Deng, Yi Yu, Shuyue Hu, Botian Shi, and Ding Wang. 2025. MemVerse: Multimodal Memory for Lifelong Learning Agents. arXiv:2512.03627 [cs.AI]

work page arXiv 2025
[21]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

2024
[22]

Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu. 2023. MemoChat: Tuning LLMs to Use Memos for Consistent Long-Range Open-Domain Conversation. arXiv:2308.08239 [cs.CL]

work page arXiv 2023
[23]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the 62nd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Com...

work page doi:10.18653/v1/2024.acl-long.747 2024
[24]

Jiayan Nan, Wenquan Ma, Wenlong Wu, and Yize Chen. 2025. Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science. arXiv:2508.03341 [cs.AI] https://arxiv.org/abs/2508.03341

work page internal anchor Pith review Pith/arXiv arXiv 2025
[25]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2024. MemGPT: Towards LLMs as Operating Systems. arXiv:2310.08560 [cs.AI]

work page internal anchor Pith review Pith/arXiv arXiv 2024
[26]

Vicky Zhao, Lili Qiu, and Jianfeng Gao

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Jianfeng Gao. 2025. SeCom: On Memory Construction and Retrieval for Personalized Conversational Agents. InThe Thirteenth International Conference on Learning Representations

2025
[27]

Qwen, :, An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, Kai Dang, Keming Lu, Keqin Bao, Kexin Yang, Le Yu, Mei Li, Mingfeng Xue, Pei Zhang, Qin Zhu, Rui Men, Runji Lin, Tianhao Li,...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[28]

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. 2025. Zep: A Temporal Knowledge Graph Architecture for Agent Memory. arXiv:2501.13956 [cs.CL]

work page internal anchor Pith review arXiv 2025
[29]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. 2024. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. InInternational Conference on Learning Representations (ICLR)

2024
[30]

Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi GU, Xiang Wang, and An Zhang. 2026. Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents. InThe Fourteenth International Conference on Learning Representations. https://openreview.net/forum?id=1cymflI2Lh

2026
[31]

Charlie Snell, Jaehoon Lee, Kelvin Xu, and Aviral Kumar. 2024. Scaling LLM Test- Time Compute Optimally can be More Effective than Scaling Model Parameters. arXiv:2408.03314 [cs.LG] https://arxiv.org/abs/2408.03314

work page internal anchor Pith review Pith/arXiv arXiv 2024
[32]

Zihao Tang, Xin Yu, Ziyu Xiao, Zengxuan Wen, Zelin Li, Jiaxi Zhou, Hualei Wang, Haohua Wang, Haizhen Huang, Weiwei Deng, Feng Sun, and Qi Zhang
[33]

Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory

Mnemis: Dual-Route Retrieval on Hierarchical Graphs for Long-Term LLM Memory. arXiv:2602.15313 [cs.CL]

work page internal anchor Pith review Pith/arXiv arXiv
[34]

Alessandra Terranova, Björn Ross, and Alexandra Birch. 2025. Evaluating Long- Term Memory for Long-Context Question Answering. arXiv:2510.23730 [cs.CL] https://arxiv.org/abs/2510.23730

work page arXiv 2025
[35]

Xinyu Wang, Mingze Li, Peng Lu, Xiao-Wen Chang, Lifeng Shang, Jinping Li, Fei Mi, Prasanna Parthasarathi, and Yufei Cui. 2026. InfMem: Learning System- 2 Memory Control for Long-Context Agent. arXiv:2602.02704 [cs.CL] https: //arxiv.org/abs/2602.02704

work page arXiv 2026
[36]

Haoran Wei, Yaofeng Sun, and Yukun Li. 2025. DeepSeek-OCR: Contexts Optical Compression. arXiv:2510.18234 [cs.CV] https://arxiv.org/abs/2510.18234

work page internal anchor Pith review arXiv 2025
[37]

Chi, Quoc V

Jason Wei, Xuezhi Wang, Dale Schuurmans, Maarten Bosma, Brian Ichter, Fei Xia, Ed H. Chi, Quoc V. Le, and Denny Zhou. 2022. Chain-of-thought prompting elicits reasoning in large language models. InProceedings of the 36th International Conference on Neural Information Processing Systems(New Orleans, LA, USA) (NIPS ’22). Curran Associates Inc., Red Hook, NY...

2022
[38]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, Xuying Ning, Mengting Ai, Yunzhe Li, Jingrui He, Ed H. Chi, Chi Wang, Shuo Chen, Fernando Pereira, Wang-Cheng Kang, and Derek Zhiyuan Cheng. 2025. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory. arXiv:2511.20857 [cs.CL]

work page internal anchor Pith review arXiv 2025
[39]

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu
[40]

InThe Thirteenth International Conference on Learning Representations

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. InThe Thirteenth International Conference on Learning Representations
[41]

Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, wenlin zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, and Tong Xu. 2025. From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents. arXiv:2505.19549 [cs.CL]

work page arXiv 2025
[42]

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang
[43]

InThe Thirty-ninth Annual Conference on Neural Information Processing Systems

A-Mem: Agentic Memory for LLM Agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems. https://openreview.net/ forum?id=FiM0M8gcct
[44]

Xiucheng Xu, Bingbing Xu, Xueyun Tian, Zihe Huang, Rongxin Chen, Yunfan Li, and Huawei Shen. 2026. Chain-of-Memory: Lightweight Memory Construction with Dynamic Evolution for LLM Agents. arXiv:2601.14287 [cs.LG]

work page internal anchor Pith review Pith/arXiv arXiv 2026
[45]

Shunyu Yao, Dian Yu, Jeffrey Zhao, Izhak Shafran, Tom Griffiths, Yuan Cao, and Karthik Narasimhan. 2023. Tree of Thoughts: Deliberate Problem Solving with Large Language Models. InAdvances in Neural Information Processing Systems, A. Oh, T. Naumann, A. Globerson, K. Saenko, M. Hardt, and S. Levine (Eds.), Vol. 36. Curran Associates, Inc., 11809–11822

2023
[46]

Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. 2023. ReAct: Synergizing Reasoning and Acting in Language Models. InInternational Conference on Learning Representations (ICLR)

2023
[47]

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. 2026. MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent. InThe Fourteenth International Conference on Learning Representations

2026
[48]

Xinlei Yu, Chengming Xu, Guibin Zhang, Zhangquan Chen, Yudong Zhang, Yongbo He, Peng-Tao Jiang, Jiangning Zhang, Xiaobin Hu, and Shuicheng Yan
[49]

Vismem: Latent vision memory unlocks potential of vision-language models.arXiv preprint arXiv:2511.11007, 2025

VisMem: Latent Vision Memory Unlocks Potential of Vision-Language Models. arXiv:2511.11007 [cs.CV]

work page arXiv
[50]

Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. 2025. G-Memory: Tracing Hierarchical Memory for Multi-Agent Systems. arXiv preprint arXiv:2506.07398(2025)

work page arXiv 2025
[51]

Guibin Zhang, Muxin Fu, and Shuicheng Yan. 2025. MemGen: Weaving Genera- tive Latent Memory for Self-Evolving Agents. arXiv:2509.24704 [cs.CL]

work page arXiv 2025
[52]

Guibin Zhang, Haotian Ren, Chong Zhan, Zhenhong Zhou, Junhao Wang, He Zhu, Wangchunshu Zhou, and Shuicheng Yan. 2025. MemEvolve: Meta-Evolution of Agent Memory Systems. arXiv:2512.18746 [cs.CL]

work page arXiv 2025
[53]

Haozhen Zhang, Quanyu Long, Jianzhu Bao, Tao Feng, Weizhi Zhang, Haodong Yue, and Wenya Wang. 2026. MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents. arXiv:2602.02474 [cs.CL]

work page internal anchor Pith review arXiv 2026
[54]

Yusen Zhang, Ruoxi Sun, Yanfei Chen, Tomas Pfister, Rui Zhang, and Sercan Ö. Arı k. 2024. Chain of Agents: Large Language Models Collaborating on Long- Context Tasks. InAdvances in Neural Information Processing Systems, A. Glober- son, L. Mackey, D. Belgrave, A. Fan, U. Paquet, J. Tomczak, and C. Zhang (Eds.), Vol. 37. Curran Associates, Inc., 132208–1322...

work page doi:10.52202/079017-4202 2024
[55]

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. 2025. A survey on the memory mechanism of large language model-based agents.ACM Transactions on Information Systems 43, 6 (2025), 1–47

2025
[56]

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2024. MemoryBank: enhancing large language models with long-term memory. In Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artifici...

work page doi:10.1609/aaai 2024
[57]

A simple yet strong baseline for long-term conversational memory of LLM agents, 2025

Sizhe Zhou and Jiawei Han. 2025. A Simple Yet Strong Baseline for Long-Term Conversational Memory of LLM Agents. arXiv:2511.17208 [cs.CL] https://arxiv. org/abs/2511.17208 10 MemCoT: Test-Time Scaling through Memory-Driven Chain-of-Thought MM ’26, November 10–14, 2026, Rio de Janeiro, Brazil A Unified Objective in MemCoT This appendix formalizes how the u...

work page arXiv 2025