RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

James Cheng; Sheng Guan; Shiyuan Deng; Xiao Yan; Xin Yao; Yizhou Tian; Zijie Dai

arxiv: 2605.16045 · v1 · pith:DEYO2J4Fnew · submitted 2026-05-15 · 💻 cs.CL · cs.AI· cs.LG

RecMem: Recurrence-based Memory Consolidation for Efficient and Effective Long-Running LLM Agents

Zijie Dai , Shiyuan Deng , Sheng Guan , Yizhou Tian , Xin Yao , Xiao Yan , James Cheng This is my paper

Pith reviewed 2026-05-20 19:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords RecMemmemory consolidationLLM agentstoken efficiencyrecurrence-based consolidationepisodic memorysemantic memorylong-running agents

0 comments

The pith

RecMem reduces the memory construction token cost of three SOTA memory systems by up to 87% while exceeding their accuracy.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces RecMem to lower the high token expense of memory systems for long-running LLM agents. Current methods invoke LLMs on every incoming interaction to extract memory, which drives up costs. RecMem instead holds interactions in a lightweight subconscious layer using embeddings and only calls the LLM for episodic and semantic memory extraction once similar interactions recur repeatedly. This targets patterns that form rich semantic clusters worth summarizing. The approach also adds a refinement step to recover details lost in extraction, yielding both lower costs and higher accuracy on agent tasks.

Core claim

RecMem stores incoming interactions in a subconscious memory layer and encodes them using lightweight embedding models for retrieval. LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions. Such recurrence-based consolidation works because these interactions correspond to a semantic cluster with rich information and thus are worth extraction and summarization. To improve accuracy, RecMem also incorporates a semantic refinement mechanism that recovers the fine-grained facts omitted by memory extraction.

What carries the argument

Recurrence-based consolidation that triggers LLM memory extraction only after sustained recurrence of semantically similar interactions detected via embeddings.

If this is right

Memory construction token usage falls by up to 87 percent relative to three prior state-of-the-art systems.
Task accuracy on agent benchmarks exceeds that of the compared memory systems.
Long-running agents can sustain effective memory over extended sessions with substantially lower LLM token budgets.
A post-extraction semantic refinement step restores fine-grained facts that summarization would otherwise omit.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same recurrence filter could be applied to other recurring LLM operations such as planning updates or tool-use logging to cut costs elsewhere.
Dynamic adjustment of the recurrence threshold per task domain might further optimize the cost-accuracy trade-off.
Production agents running for days or weeks could maintain coherent memory with token budgets that scale sub-linearly with interaction volume.

Load-bearing premise

Sustained recurrence of semantically similar interactions corresponds to a semantic cluster with rich information and is therefore worth LLM-based extraction and summarization.

What would settle it

On the same long-running agent benchmarks used in the paper, if forcing memory extraction on every interaction produces higher task accuracy than RecMem at comparable or lower token cost, the efficiency claim would be falsified.

Figures

Figures reproduced from arXiv: 2605.16045 by James Cheng, Sheng Guan, Shiyuan Deng, Xiao Yan, Xin Yao, Yizhou Tian, Zijie Dai.

**Figure 2.** Figure 2: Ablation study for RecMem on LoCoMo co-referent mentions regardless of temporal distance, and timestamp-sorted episodic consolidation (§3.3) reconstructs chronological order within each cluster. Semantic refinement additionally extracts time-anchored facts grounded in the raw interaction units, serving as a second safeguard for finegrained temporal evidence that episodic abstraction may compress away. C… view at source ↗

**Figure 3.** Figure 3: A simplified memory ingestion process in RecMem [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Sensitivity of consolidation thresholds on LoCoMo (GPT-4.1-mini). (a) Overall score vs. [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

**Figure 5.** Figure 5: Sensitivity of retrieval budgets on LoCoMo (GPT-4.1-mini). (a) Overall score vs. subconscious retrieval [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Episodic Memory Generation Role Description [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗

**Figure 7.** Figure 7: Episodic Memory Generation Instruction [PITH_FULL_IMAGE:figures/full_fig_p019_7.png] view at source ↗

**Figure 8.** Figure 8: Episodic Memory Output Format You are a Semantic Memory Extractor for a long-term memory system. Inputs: - Subconscious Memory R: the original detailed messages that are related - Episodic memory E: a short narrative summary generated from the raw reference R. - Old semantic memories S: previously stored facts about the topic related to E. Goal: Extract NEW, HIGH-UTILITY facts that will help answer future … view at source ↗

**Figure 9.** Figure 9: Semantic Memory Generation Role Description [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: Semantic Memory Generation Instruction Style: - One fact per sentence. - Neutral, factual tone. - Do NOT speculate beyond what E, R, and S support. - Avoid long lists; summarize them into a single concise fact when possible. Output format: Return ONLY a JSON object: { "facts": [ "First new semantic fact...", "Second new semantic fact..." ] } If there are no new facts, return: { "facts": [] } Episodic Memo… view at source ↗

**Figure 11.** Figure 11: Semantic Memory Output Format [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Episodic Merging Role Description When to merge: Treat the new and past memory pieces as strongly related (should_merge = "yes") if MOST of the following hold: - They describe the same ongoing situation, goal, project, problem, or life event for the same main person(s) in the conversation (user, assistant, or other real participants), not just similar fictional stories or generic examples. - The new piece… view at source ↗

**Figure 13.** Figure 13: Episodic Merging Instruction [PITH_FULL_IMAGE:figures/full_fig_p022_13.png] view at source ↗

**Figure 14.** Figure 14: Episodic Merging Output Format You are an intelligent memory assistant tasked with answering questions using conversation memories. # CONTEXT You have access to memories from two speakers in a conversation. These memories are timestamped and may be relevant to the question. There are three types of memories: 1. Episodic Memories: refined summaries of related conversation turns about the same topic. 2. Sem… view at source ↗

**Figure 15.** Figure 15: Answering Role Description [PITH_FULL_IMAGE:figures/full_fig_p023_15.png] view at source ↗

**Figure 16.** Figure 16: Answering Instruction [PITH_FULL_IMAGE:figures/full_fig_p024_16.png] view at source ↗

read the original abstract

Memory systems often organize user-agent interactions as retrievable external memory and are crucial for long-running agents by overcoming the limited context windows of LLMs. However, existing memory systems invoke LLMs to process every incoming interaction for memory extraction, and such an eager memory consolidation scheme leads to substantial token consumption. To tackle this problem, we propose RecMem by rethinking when memory consolidation should be conducted. RecMem stores incoming interactions in a subconscious memory layer and encode them using lightweight embedding models for retrieval. LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions. Such recurrence-based consolidation works because these interactions correspond to a semantic cluster with rich information and thus are worth extraction and summarization. To improve accuracy, RecMem also incorporates a semantic refinement mechanism that recovers the fine-grained facts omitted by memory extraction. Experiments show that RecMem reduces the memory construction token cost of three SOTA memory systems by up to 87% while exceeding their accuracy.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

RecMem delays LLM-based memory extraction until embeddings detect sustained recurrence of similar interactions, claiming up to 87% token savings and higher accuracy than prior eager systems.

read the letter

RecMem's core move is to stop calling the LLM on every incoming interaction for memory work. Instead it keeps raw interactions in a cheap subconscious layer, encodes them with embeddings, and only triggers full extraction and summarization once similar interactions recur over time. The authors add a semantic refinement pass afterward to pull back fine-grained facts that the summarization step might drop. This is the main thing a colleague should know: a practical alternative to the always-on consolidation used in current agent memory systems, with reported cost cuts of up to 87% and accuracy that beats the baselines they tested against three SOTA systems. The recurrence trigger and the subconscious layer are the clearest differences from earlier work. The refinement step is a sensible addition to protect accuracy. The paper does a reasonable job framing the token-consumption problem and showing why waiting for repetition could be a reasonable signal that a cluster is worth the LLM cost. The results numbers are the part that would interest people actually running long agents. The assumption that embedding-based recurrence reliably flags clusters with high information density is the soft spot. The abstract states it as the reason the method works, but there is no reported ablation that directly measures unique facts, entropy, or downstream utility in the recurrence-triggered sets versus random or low-recurrence ones. If the embeddings are mostly catching surface repetition rather than substantive content, the token savings could come with omitted details that refinement does not fully recover. The accuracy improvement is stated, yet without the full experimental controls, dataset details, or statistical tests it is hard to judge how robust the comparison is. This is the kind of paper that belongs in a reading group for people working on agent memory or efficiency. Anyone building persistent LLM agents that span many turns would see immediate practical value in the cost numbers and the mechanism. The proposal is concrete enough and the problem real enough that it deserves a serious referee rather than a desk reject, mainly to pressure-test the recurrence assumption and the experimental setup. I would send it out for review.

Referee Report

1 major / 1 minor

Summary. The manuscript proposes RecMem, a memory consolidation system for long-running LLM agents. Incoming interactions are stored in a subconscious layer and encoded with lightweight embeddings; LLMs are invoked for episodic/semantic memory extraction only upon detection of sustained recurrence among semantically similar interactions. A semantic refinement step is added to recover fine-grained facts omitted during extraction. The central claim is that this recurrence-triggered approach reduces memory-construction token cost by up to 87% relative to three SOTA baselines while exceeding their accuracy.

Significance. If the empirical claims are substantiated, RecMem would provide a practical mechanism for lowering the LLM-token overhead of agent memory systems, potentially enabling longer-running agents at reduced cost. The recurrence-based trigger represents a distinct design choice from eager consolidation and could inform subsequent work on efficient memory architectures.

major comments (1)

[§3] §3 (method description): The premise that sustained recurrence of embedding-similar interactions corresponds to a semantic cluster containing extractable rich information is load-bearing for both the efficiency and accuracy claims, yet no direct measurement or ablation is reported. An experiment comparing information density (unique facts, entropy, or downstream utility) of recurrence-triggered clusters against random or low-recurrence sets would be required to confirm that the reported 87% token reduction does not trade off omitted facts that the refinement step cannot recover.

minor comments (1)

The experimental section should specify the exact datasets, interaction lengths, statistical tests for accuracy differences, and full hyper-parameter settings for the three SOTA baselines to permit independent verification of the cost and accuracy results.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address the major comment below and have revised the manuscript to incorporate additional analysis as suggested.

read point-by-point responses

Referee: [§3] §3 (method description): The premise that sustained recurrence of embedding-similar interactions corresponds to a semantic cluster containing extractable rich information is load-bearing for both the efficiency and accuracy claims, yet no direct measurement or ablation is reported. An experiment comparing information density (unique facts, entropy, or downstream utility) of recurrence-triggered clusters against random or low-recurrence sets would be required to confirm that the reported 87% token reduction does not trade off omitted facts that the refinement step cannot recover.

Authors: We agree that a direct ablation measuring information density would strengthen the justification for the recurrence trigger. The original manuscript motivates the approach by noting that sustained recurrence signals semantically coherent clusters worth consolidating, with end-to-end results showing both large token savings and higher accuracy than eager baselines. This outcome provides indirect evidence that non-recurrent interactions contribute less unique value. To directly address the concern, we have added a new analysis (revised §4.3 and new Table 3) that extracts and counts unique facts from recurrence-triggered clusters versus size-matched random and low-recurrence sets. The recurrence clusters yield 2.1–2.4× higher unique-fact density on average across the three evaluation domains, with the semantic refinement step recovering the remaining details. These results confirm that the 87% token reduction does not sacrifice recoverable information. revision: yes

Circularity Check

0 steps flagged

No significant circularity; design choice is independent of results

full rationale

The paper presents RecMem as an engineering design: store interactions in a subconscious layer, use lightweight embeddings to detect sustained recurrence of similar interactions, and only then invoke LLMs for episodic/semantic extraction plus refinement. The statement that recurrence 'corresponds to a semantic cluster with rich information' is an explicit motivating assumption, not a derived quantity obtained by fitting or by re-using the target accuracy metric. No equations, fitted parameters, or self-citations are shown that would make any reported efficiency gain (the 87 % token reduction) equivalent to the input data or to a prior result by the same authors. The accuracy claim is supported by direct experimental comparison against three external SOTA baselines rather than by internal re-labeling or self-referential uniqueness theorems. The derivation chain therefore remains self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Review is based solely on the abstract; no explicit free parameters, axioms, or invented entities are detailed beyond the high-level design. The central claim rests on the domain assumption that recurrence signals rich semantic clusters worth summarization.

axioms (1)

domain assumption LLMs have limited context windows that necessitate external memory systems for long-running agents.
Stated as the core motivation in the abstract.

pith-pipeline@v0.9.0 · 5718 in / 1192 out tokens · 64274 ms · 2026-05-20T19:15:48.663589+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

LLMs are only invoked to extract episodic and semantic memory when sustained recurrence are observed for semantically similar interactions. Such recurrence-based consolidation works because these interactions correspond to a semantic cluster with rich information
IndisputableMonolith/Foundation/ArithmeticFromLogic.lean LogicNat induction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Inspired by cognitive science... isolated experiences remain in transient or rapidly-encoded stores, and only repeated or recurring patterns drive consolidation into stable long-term memory

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

58 extracted references · 58 canonical work pages · 2 internal anchors

[1]

Evaluating very long-term conversational memory of llm agents.Preprint, arXiv:2402.17753. James L. McClelland, Bruce L. McNaughton, and Ran- dall C. O’Reilly. 1995. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connec- tionist models of learning and memory.Psychologi- cal review,...

work page internal anchor Pith review Pith/arXiv arXiv 1995
[2]

MemGPT: Towards LLMs as Operating Systems

Memgpt: Towards llms as operating systems. Preprint, arXiv:2310.08560. Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. 2025. Zep: A tempo- ral knowledge graph architecture for agent memory. Preprint, arXiv:2501.13956. Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y

From isolated conversations to hierarchical schemas: Dynamic tree memory representation for llms.Preprint, arXiv:2410.14052. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.Prepri...

work page arXiv 2024
[4]

A-Mem (Xu et al., 2025b): Inspired by the Zettelkasten method (Kadavy, 2021; Ahrens, 2017), it treats interactions as discrete "notes" in a network, where consolidation involves generating embeddings and establishing asso- ciative links between new and existing notes

work page 2021
[5]

TreeMem (Rezazadeh et al., 2025): Maintains a hierarchical summary tree. New information is not just appended but traverses down to spe- cific leaf nodes based on semantic relevance, forcing a recursive chain of summary updates from the leaf back up to the root to keep the hierarchy consistent

work page 2025
[6]

Temporal Knowledge Graph

Zep (Rasmussen et al., 2025): Parses inter- actions into a "Temporal Knowledge Graph." It actively extracts entities and relationships from each turn, modeling them as nodes and edges while explicitly updating the temporal metadata of these connections

work page 2025
[7]

It requires per-turn analysis to identify multi-hop relationships between entities, dynamically updating the graph struc- ture as the conversation evolves

Mem0 (Graph Variant) (Chhikara et al., 2025): Extends atomic fact extraction by organizing data into a graph. It requires per-turn analysis to identify multi-hop relationships between entities, dynamically updating the graph struc- ture as the conversation evolves. Fact and Summary-based ConsolidationThese systems function as active distillers, where the ...

work page 2025
[8]

It prompts the LLM to identify atomic facts (e.g., entity-relation triplets), instructing it to add, update, or delete records in the vec- tor database to reflect the latest state

Mem0 (Chhikara et al., 2025): Runs a dedi- cated extraction pipeline after every user mes- sage. It prompts the LLM to identify atomic facts (e.g., entity-relation triplets), instructing it to add, update, or delete records in the vec- tor database to reflect the latest state

work page 2025
[9]

MemoryOS (Kang et al., 2025): Features a multi-tiered architecture (Short-, Mid-, and Long-term memories) to manage context flow, emphasizing a dedicated Profile Memory mod- ule that explicitly maintains evolving user per- sonas and agent guidelines

work page 2025
[10]

Knowledge

Mirix (Wang and Chen, 2025): Routes ev- ery interaction through a parallel extraction pipeline. Raw text is simultaneously pro- cessed by distinct modules to distill specific "Knowledge" facts and "Event" summaries, creating a synchronized update across multi- ple memory stores

work page 2025
[11]

Core Memory

MemGPT (Packer et al., 2024): Treats mem- ory management as an operating system pro- cess, employing self-directed function calls to actively summarize and compress ongoing interactions into a fixed-size "Core Memory" block, ensuring key persona and user details are preserved while offloading raw history. A.2 Retrieval Mechanisms While memory consolidatio...

work page 2024
[12]

D.1 LoCoMo LoCoMo (Long-Context Memory) is a benchmark designed to evaluate memory systems in casual, social settings

and LongMemEval-S (Wu et al., 2025). D.1 LoCoMo LoCoMo (Long-Context Memory) is a benchmark designed to evaluate memory systems in casual, social settings. Unlike standard user-agent inter- actions, the source texts consist of multi-session human-to-human dialogues between two distinct speakers, simulating the natural evolution of a long- term relationshi...

work page 2025
[13]

Single-hop Retrieval:Questions requiring the retrieval of a specific fact mentioned in a single past session

work page
[14]

Multi-hop Reasoning:Questions that require synthesizing information distributed across multiple distinct sessions to derive an answer

work page
[15]

Temporal Reasoning:Questions testing the system’s ability to understand the sequence of events and relative time expressions

work page
[16]

Open-domain Knowledge:Questions that require combining memory retrieval with ex- ternal world knowledge

work page
[17]

We ex- clude this category as it lacks reliable ground- truth answers for automated evaluation

Adversarial (Excluded):Questions designed to trick the model with false premises. We ex- clude this category as it lacks reliable ground- truth answers for automated evaluation. D.2 LongMemEval-S LongMemEval-S is a subset of the LongMemEval benchmark, curated to evaluate memory systems in agentic, task-orientedinteractions with long con- text windows. Dat...

work page
[18]

Single-session-user:Evaluates the retrieval of specific details explicitly mentioned by the userwithin the bounds of a single conversa- tion session

work page
[19]

Single-session-assistant:Tests the system’s ability to recall information provided by the assistantitself within a single session, ensur- ing consistency in the agent’s own history

work page
[20]

Single-session-preference:Assesses whether the model can effectively apply retrieved user information to generate personalized, context- aware responses

work page
[21]

Multi-session:Requires the aggregation of disjoint pieces of information scattered across two or more sessions to derive a complete answer

work page
[22]

Knowledge-update:Probes the system’s ca- pacity to track dynamic changes in the user’s life state and supersede outdated information with new updates

work page
[23]

16 March, 2023

Temporal-reasoning:Demands chronologi- cal deduction by synthesizing both the session metadata (timestamps) and explicit time ex- pressions found in the text. E Experiment Details E.1 Baseline Configurations To ensure fair and reliable comparisons, we con- figure each baseline to faithfully reflect its original design choices, rather than enforcing a unif...

work page 2025
[24]

- Messages belong to the same thread if they refer to the same ongoing goal, project, problem, or situation for the conversation participants, even if they are days or weeks apart

Identify topic threads - First, mentally group the messages into topic threads. - Messages belong to the same thread if they refer to the same ongoing goal, project, problem, or situation for the conversation participants, even if they are days or weeks apart. - Ignore or down-weight one-off fictional or hypothetical stories that do not affect the speaker...

work page
[25]

Initially

Build temporal structure for each episode - For each thread that has enough information, order the relevant events chronologically. - Highlight how the situation develops over time: initial situation, updates, changes of plan, decisions, outcomes, and reflections. - Emphasize how the speakers' state (plans, preferences, beliefs, emotional reactions) evolv...

work page
[26]

yesterday

Handle time expressions correctly - The timestamp of a message is the time when the user said it. It is NOT always the time when the described event happened. - If a message uses relative time expressions such as "yesterday", "two days ago", "next week", rewrite them in your episode as explicit expressions relative to the timestamp. For example:- "the day...

work page 2025
[27]

the assistant told a story about X to illustrate Y

Focus on episodic narratives, not isolated facts - Your goal is to construct narrative episodes: what happened, how it evolved, and why it matters to the user. - Focus on the outer conversation between the two speakers (their goals, decisions, preferences, constraints, and what has been explained to them). - When a speaker tells a long story, gives an ext...

work page
[28]

episodes

Style and output format - Write each episode as a short, well-formed paragraph (3 to 6 sentences) in clear, neutral language. - Keep episodes compact. Do not reproduce long fictional plots, full technical explanations, or long lists; refer to them briefly if needed. - Prefer merging related events into a single episode over splitting them into many small ...

work page
[29]

USER-CENTRIC: Focus on the user's goals, preferences, constraints, decisions, actions, and recurring plans

work page
[30]

TEMPORALLY GROUNDED: For events and changes, include an explicit date anchor when available

work page
[31]

Fewer, higher-value facts are better than many low-value facts

COMPACT BUT USEFUL: Output less than 10 facts. Fewer, higher-value facts are better than many low-value facts. Each fact MUST belong to one of these types:

work page
[32]

On 2023-05-22, the user

USER_EVENT - The user asked for something, attended something, started or stopped something, or made a concrete decision. - Include a date anchor from R if possible, e.g., "On 2023-05-22, the user ..."

work page 2023
[33]

USER_CONSTRAINT_OR_PREFERENCE - A relatively stable preference, constraint, or recurring plan (e.g., long-term goals, platform choice, budget range, time constraints, content or style preferences)

work page
[34]

- State the new value and, if known, the old value or state, with a clear time anchor

TIME_ANCHORED_UPDATE - A change in behavior, preferences, tools, roles, budgets, relationships, etc. - State the new value and, if known, the old value or state, with a clear time anchor

work page
[35]

in March 2025

ENTITY_RELATION (USER-RELEVANT) - A specific relationship between named entities that is relevant to the user (e.g., the user's roles, organizations, projects, courses, tools, locations, or other people they interact with). - Only keep such a fact if it is specific and likely to matter in future reasoning about the user. Do NOT output generic best-practic...

work page 2025
[36]

Decide whether the new memory piece and the past memory piece describe the same ongoing topic or episode for the conversation participants

work page
[37]

later",

If yes, merge them into ONE coherent episodic memory and return the merged result. Figure 12: Episodic Merging Role Description When to merge: Treat the new and past memory pieces as strongly related (should_merge = "yes") if MOST of the following hold: - They describe the same ongoing situation, goal, project, problem, or life event for the same main per...

work page 2024
[38]

Episodic Memories: refined summaries of related conversation turns about the same topic

work page
[39]

Semantic Memories: concise, fact-like pieces extracted from conversations

work page
[40]

Context Rules:

Subconscious Memories: unprocessed conversation snippets between the two speakers. Context Rules:

work page
[41]

atomic facts)

Episodic and semantic memories may overlap in content (event summary vs. atomic facts). Avoid double-counting redundant evidence

work page
[42]

Carefully analyze all three memory types and identify information that is actually useful for answering the question

work page
[43]

Figure 15: Answering Role Description # INSTRUCTIONS

Memories within each type are sorted by relevance. Figure 15: Answering Role Description # INSTRUCTIONS

work page
[44]

Carefully read all provided memories

work page
[45]

Pay close attention to timestamps when time is relevant

work page
[46]

If the question asks about a specific event or fact (who / where / when / what), look for direct, explicit evidence in the memories

work page
[47]

If the question asks for advice, recommendations, or what kind of response the user would prefer, - first identify any user-specific preferences, habits, constraints, or past actions from the memories, - then base your suggestion primarily on these user-specific signals, - and only fall back to generic advice when no relevant user information exists

work page
[48]

If memories contain contradictory information, prioritize the most recent memory

work page
[49]

last year

For time references (e.g., "last year", "two months ago"), convert them into concrete dates based on the memory timestamp. For example, if a memory from 4 May 2022 says "went to India last year", infer that the trip happened in 2021, and answer with "2021" or "the year before 2022", not just "last year"

work page 2022
[50]

Do not confuse the time of conversation with the time when an event actually happened if the text distinguishes them

In subconscious memories, the final timestamp marks the conversation time. Do not confuse the time of conversation with the time when an event actually happened if the text distinguishes them

work page
[51]

no information found

Do not say "no information found" if there are related memories that can reasonably guide a personalized answer. Only abstain when there is truly no relevant evidence. # APPROACH (Think step by step internally)

work page
[52]

Identify whether the question is (a) a factual query or (b) an advice/preference/recommendation query

work page
[53]

Retrieve all memories that are clearly related to the question

work page
[54]

Check timestamps and content to locate the most reliable and up-to-date information

work page
[55]

For factual queries, pinpoint explicit mentions of dates, times, locations, entities, or events that directly answer the question

work page
[56]

For advice / preference queries, determine what the user has already done, bought, liked, disliked, or constrained, and use these as anchors for a tailored answer

work page
[57]

If temporal reasoning or simple calculation is needed, do it internally and convert the result into a concrete, explicit date or time span in the final answer

work page
[58]

Formulate a precise, concise answer that directly addresses the question and is fully supported by the memories and reasonable inferences from them. Episodic Memories: {{ episodic_memories }} Subconscious Memories: {{ subconscious_memories }} Semantic Memories: {{ semantic_memories }} Question: {{ question }} Answer: Figure 16: Answering Instruction

work page

[1] [1]

Evaluating very long-term conversational memory of llm agents.Preprint, arXiv:2402.17753. James L. McClelland, Bruce L. McNaughton, and Ran- dall C. O’Reilly. 1995. Why there are complementary learning systems in the hippocampus and neocortex: insights from the successes and failures of connec- tionist models of learning and memory.Psychologi- cal review,...

work page internal anchor Pith review Pith/arXiv arXiv 1995

[2] [2]

MemGPT: Towards LLMs as Operating Systems

Memgpt: Towards llms as operating systems. Preprint, arXiv:2310.08560. Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. 2025. Zep: A tempo- ral knowledge graph architecture for agent memory. Preprint, arXiv:2501.13956. Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y

From isolated conversations to hierarchical schemas: Dynamic tree memory representation for llms.Preprint, arXiv:2410.14052. Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. 2024. Deepseekmath: Pushing the limits of mathemati- cal reasoning in open language models.Prepri...

work page arXiv 2024

[4] [4]

A-Mem (Xu et al., 2025b): Inspired by the Zettelkasten method (Kadavy, 2021; Ahrens, 2017), it treats interactions as discrete "notes" in a network, where consolidation involves generating embeddings and establishing asso- ciative links between new and existing notes

work page 2021

[5] [5]

TreeMem (Rezazadeh et al., 2025): Maintains a hierarchical summary tree. New information is not just appended but traverses down to spe- cific leaf nodes based on semantic relevance, forcing a recursive chain of summary updates from the leaf back up to the root to keep the hierarchy consistent

work page 2025

[6] [6]

Temporal Knowledge Graph

Zep (Rasmussen et al., 2025): Parses inter- actions into a "Temporal Knowledge Graph." It actively extracts entities and relationships from each turn, modeling them as nodes and edges while explicitly updating the temporal metadata of these connections

work page 2025

[7] [7]

It requires per-turn analysis to identify multi-hop relationships between entities, dynamically updating the graph struc- ture as the conversation evolves

Mem0 (Graph Variant) (Chhikara et al., 2025): Extends atomic fact extraction by organizing data into a graph. It requires per-turn analysis to identify multi-hop relationships between entities, dynamically updating the graph struc- ture as the conversation evolves. Fact and Summary-based ConsolidationThese systems function as active distillers, where the ...

work page 2025

[8] [8]

It prompts the LLM to identify atomic facts (e.g., entity-relation triplets), instructing it to add, update, or delete records in the vec- tor database to reflect the latest state

Mem0 (Chhikara et al., 2025): Runs a dedi- cated extraction pipeline after every user mes- sage. It prompts the LLM to identify atomic facts (e.g., entity-relation triplets), instructing it to add, update, or delete records in the vec- tor database to reflect the latest state

work page 2025

[9] [9]

MemoryOS (Kang et al., 2025): Features a multi-tiered architecture (Short-, Mid-, and Long-term memories) to manage context flow, emphasizing a dedicated Profile Memory mod- ule that explicitly maintains evolving user per- sonas and agent guidelines

work page 2025

[10] [10]

Knowledge

Mirix (Wang and Chen, 2025): Routes ev- ery interaction through a parallel extraction pipeline. Raw text is simultaneously pro- cessed by distinct modules to distill specific "Knowledge" facts and "Event" summaries, creating a synchronized update across multi- ple memory stores

work page 2025

[11] [11]

Core Memory

MemGPT (Packer et al., 2024): Treats mem- ory management as an operating system pro- cess, employing self-directed function calls to actively summarize and compress ongoing interactions into a fixed-size "Core Memory" block, ensuring key persona and user details are preserved while offloading raw history. A.2 Retrieval Mechanisms While memory consolidatio...

work page 2024

[12] [12]

D.1 LoCoMo LoCoMo (Long-Context Memory) is a benchmark designed to evaluate memory systems in casual, social settings

and LongMemEval-S (Wu et al., 2025). D.1 LoCoMo LoCoMo (Long-Context Memory) is a benchmark designed to evaluate memory systems in casual, social settings. Unlike standard user-agent inter- actions, the source texts consist of multi-session human-to-human dialogues between two distinct speakers, simulating the natural evolution of a long- term relationshi...

work page 2025

[13] [13]

Single-hop Retrieval:Questions requiring the retrieval of a specific fact mentioned in a single past session

work page

[14] [14]

Multi-hop Reasoning:Questions that require synthesizing information distributed across multiple distinct sessions to derive an answer

work page

[15] [15]

Temporal Reasoning:Questions testing the system’s ability to understand the sequence of events and relative time expressions

work page

[16] [16]

Open-domain Knowledge:Questions that require combining memory retrieval with ex- ternal world knowledge

work page

[17] [17]

We ex- clude this category as it lacks reliable ground- truth answers for automated evaluation

Adversarial (Excluded):Questions designed to trick the model with false premises. We ex- clude this category as it lacks reliable ground- truth answers for automated evaluation. D.2 LongMemEval-S LongMemEval-S is a subset of the LongMemEval benchmark, curated to evaluate memory systems in agentic, task-orientedinteractions with long con- text windows. Dat...

work page

[18] [18]

Single-session-user:Evaluates the retrieval of specific details explicitly mentioned by the userwithin the bounds of a single conversa- tion session

work page

[19] [19]

Single-session-assistant:Tests the system’s ability to recall information provided by the assistantitself within a single session, ensur- ing consistency in the agent’s own history

work page

[20] [20]

Single-session-preference:Assesses whether the model can effectively apply retrieved user information to generate personalized, context- aware responses

work page

[21] [21]

Multi-session:Requires the aggregation of disjoint pieces of information scattered across two or more sessions to derive a complete answer

work page

[22] [22]

Knowledge-update:Probes the system’s ca- pacity to track dynamic changes in the user’s life state and supersede outdated information with new updates

work page

[23] [23]

16 March, 2023

Temporal-reasoning:Demands chronologi- cal deduction by synthesizing both the session metadata (timestamps) and explicit time ex- pressions found in the text. E Experiment Details E.1 Baseline Configurations To ensure fair and reliable comparisons, we con- figure each baseline to faithfully reflect its original design choices, rather than enforcing a unif...

work page 2025

[24] [24]

- Messages belong to the same thread if they refer to the same ongoing goal, project, problem, or situation for the conversation participants, even if they are days or weeks apart

Identify topic threads - First, mentally group the messages into topic threads. - Messages belong to the same thread if they refer to the same ongoing goal, project, problem, or situation for the conversation participants, even if they are days or weeks apart. - Ignore or down-weight one-off fictional or hypothetical stories that do not affect the speaker...

work page

[25] [25]

Initially

Build temporal structure for each episode - For each thread that has enough information, order the relevant events chronologically. - Highlight how the situation develops over time: initial situation, updates, changes of plan, decisions, outcomes, and reflections. - Emphasize how the speakers' state (plans, preferences, beliefs, emotional reactions) evolv...

work page

[26] [26]

yesterday

Handle time expressions correctly - The timestamp of a message is the time when the user said it. It is NOT always the time when the described event happened. - If a message uses relative time expressions such as "yesterday", "two days ago", "next week", rewrite them in your episode as explicit expressions relative to the timestamp. For example:- "the day...

work page 2025

[27] [27]

the assistant told a story about X to illustrate Y

Focus on episodic narratives, not isolated facts - Your goal is to construct narrative episodes: what happened, how it evolved, and why it matters to the user. - Focus on the outer conversation between the two speakers (their goals, decisions, preferences, constraints, and what has been explained to them). - When a speaker tells a long story, gives an ext...

work page

[28] [28]

episodes

Style and output format - Write each episode as a short, well-formed paragraph (3 to 6 sentences) in clear, neutral language. - Keep episodes compact. Do not reproduce long fictional plots, full technical explanations, or long lists; refer to them briefly if needed. - Prefer merging related events into a single episode over splitting them into many small ...

work page

[29] [29]

USER-CENTRIC: Focus on the user's goals, preferences, constraints, decisions, actions, and recurring plans

work page

[30] [30]

TEMPORALLY GROUNDED: For events and changes, include an explicit date anchor when available

work page

[31] [31]

Fewer, higher-value facts are better than many low-value facts

COMPACT BUT USEFUL: Output less than 10 facts. Fewer, higher-value facts are better than many low-value facts. Each fact MUST belong to one of these types:

work page

[32] [32]

On 2023-05-22, the user

USER_EVENT - The user asked for something, attended something, started or stopped something, or made a concrete decision. - Include a date anchor from R if possible, e.g., "On 2023-05-22, the user ..."

work page 2023

[33] [33]

USER_CONSTRAINT_OR_PREFERENCE - A relatively stable preference, constraint, or recurring plan (e.g., long-term goals, platform choice, budget range, time constraints, content or style preferences)

work page

[34] [34]

- State the new value and, if known, the old value or state, with a clear time anchor

TIME_ANCHORED_UPDATE - A change in behavior, preferences, tools, roles, budgets, relationships, etc. - State the new value and, if known, the old value or state, with a clear time anchor

work page

[35] [35]

in March 2025

ENTITY_RELATION (USER-RELEVANT) - A specific relationship between named entities that is relevant to the user (e.g., the user's roles, organizations, projects, courses, tools, locations, or other people they interact with). - Only keep such a fact if it is specific and likely to matter in future reasoning about the user. Do NOT output generic best-practic...

work page 2025

[36] [36]

Decide whether the new memory piece and the past memory piece describe the same ongoing topic or episode for the conversation participants

work page

[37] [37]

later",

If yes, merge them into ONE coherent episodic memory and return the merged result. Figure 12: Episodic Merging Role Description When to merge: Treat the new and past memory pieces as strongly related (should_merge = "yes") if MOST of the following hold: - They describe the same ongoing situation, goal, project, problem, or life event for the same main per...

work page 2024

[38] [38]

Episodic Memories: refined summaries of related conversation turns about the same topic

work page

[39] [39]

Semantic Memories: concise, fact-like pieces extracted from conversations

work page

[40] [40]

Context Rules:

Subconscious Memories: unprocessed conversation snippets between the two speakers. Context Rules:

work page

[41] [41]

atomic facts)

Episodic and semantic memories may overlap in content (event summary vs. atomic facts). Avoid double-counting redundant evidence

work page

[42] [42]

Carefully analyze all three memory types and identify information that is actually useful for answering the question

work page

[43] [43]

Figure 15: Answering Role Description # INSTRUCTIONS

Memories within each type are sorted by relevance. Figure 15: Answering Role Description # INSTRUCTIONS

work page

[44] [44]

Carefully read all provided memories

work page

[45] [45]

Pay close attention to timestamps when time is relevant

work page

[46] [46]

If the question asks about a specific event or fact (who / where / when / what), look for direct, explicit evidence in the memories

work page

[47] [47]

If the question asks for advice, recommendations, or what kind of response the user would prefer, - first identify any user-specific preferences, habits, constraints, or past actions from the memories, - then base your suggestion primarily on these user-specific signals, - and only fall back to generic advice when no relevant user information exists

work page

[48] [48]

If memories contain contradictory information, prioritize the most recent memory

work page

[49] [49]

last year

For time references (e.g., "last year", "two months ago"), convert them into concrete dates based on the memory timestamp. For example, if a memory from 4 May 2022 says "went to India last year", infer that the trip happened in 2021, and answer with "2021" or "the year before 2022", not just "last year"

work page 2022

[50] [50]

Do not confuse the time of conversation with the time when an event actually happened if the text distinguishes them

In subconscious memories, the final timestamp marks the conversation time. Do not confuse the time of conversation with the time when an event actually happened if the text distinguishes them

work page

[51] [51]

no information found

Do not say "no information found" if there are related memories that can reasonably guide a personalized answer. Only abstain when there is truly no relevant evidence. # APPROACH (Think step by step internally)

work page

[52] [52]

Identify whether the question is (a) a factual query or (b) an advice/preference/recommendation query

work page

[53] [53]

Retrieve all memories that are clearly related to the question

work page

[54] [54]

Check timestamps and content to locate the most reliable and up-to-date information

work page

[55] [55]

For factual queries, pinpoint explicit mentions of dates, times, locations, entities, or events that directly answer the question

work page

[56] [56]

For advice / preference queries, determine what the user has already done, bought, liked, disliked, or constrained, and use these as anchors for a tailored answer

work page

[57] [57]

If temporal reasoning or simple calculation is needed, do it internally and convert the result into a concrete, explicit date or time span in the final answer

work page

[58] [58]

Formulate a precise, concise answer that directly addresses the question and is fully supported by the memories and reasonable inferences from them. Episodic Memories: {{ episodic_memories }} Subconscious Memories: {{ subconscious_memories }} Semantic Memories: {{ semantic_memories }} Question: {{ question }} Answer: Figure 16: Answering Instruction

work page