pith. sign in

arxiv: 2605.21768 · v1 · pith:62FV7ZIAnew · submitted 2026-05-20 · 💻 cs.LG · cs.MA

Memory-R2: Fair Credit Assignment for Long-Horizon Memory-Augmented LLM Agents

Pith reviewed 2026-05-22 09:14 UTC · model grok-4.3

classification 💻 cs.LG cs.MA
keywords memory-augmented LLM agentscredit assignmentreinforcement learninggroup-relative optimizationlong-horizon trainingmemory formationmulti-session environments
0
0 comments X

The pith

LoGo-GRPO enables fair credit assignment for memory operations in long-horizon LLM agents by comparing outcomes from the same intermediate memory state.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Memory-augmented LLM agents store and reuse information across sessions, but reinforcement learning training faces a core issue: different rollouts write or change memories differently, so they no longer operate in the same environment. This breaks the fairness assumption in group-relative methods like GRPO and turns trajectory rewards into noisy signals for credit assignment to individual memory steps. Memory-R2 introduces LoGo-GRPO to run local rerollouts from an identical memory state for direct comparisons of memory operations while still using global trajectory rewards to keep end-to-end learning. The framework adds shared-parameter co-learning for fact extraction and memory management plus a curriculum that grows sessions from 8 to 32. A sympathetic reader would see this as a route to stable training of agents that remember and act over many interactions.

Core claim

The paper claims that LoGo-GRPO yields fairer group comparisons and more precise supervision for memory construction by comparing different memory-operation outcomes from the same intermediate memory state while preserving end-to-end learning from long-horizon trajectory-level rewards. This is realized through local rerollouts combined with global group-relative optimization, a shared-parameter design that instantiates a fact extractor and memory manager from one LLM backbone, and a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions.

What carries the argument

LoGo-GRPO, which runs local rerollouts from an identical intermediate memory state to compare memory-operation outcomes fairly while retaining global optimization on full trajectory rewards.

If this is right

  • Trajectory-level rewards now supply precise signals for individual memory operations such as write, update, or delete.
  • Memory formation and memory evolution can be jointly optimized through shared parameters.
  • Training stays stable while the number of sessions grows progressively from 8 to 32.
  • Multi-session environments become usable for reinforcement learning without systematic bias in group comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The local-rerollout technique could apply to any agent system whose actions persistently change future observations, such as database agents or long-running planners.
  • Progressive lengthening of horizons may prove necessary whenever reinforcement learning must handle accumulating state changes.
  • The same design might raise performance in sequential tasks that rely on retained facts, including multi-turn dialogue or cumulative reasoning chains.

Load-bearing premise

Local rerollouts starting from an identical intermediate memory state produce sufficiently representative and unbiased comparisons for credit assignment to memory operations.

What would settle it

An experiment in which agents trained with LoGo-GRPO show no reduction in credit signal noise or no performance gain over standard GRPO on tasks that require consistent memory across many sessions.

Figures

Figures reproduced from arXiv: 2605.21768 by Ahmed Bahloul, Ercong Nie, Riccardo Trivisonno, Sikuan Yan, Susanna Schwarzmann, Volker Tresp, Yunpu Ma.

Figure 1
Figure 1. Figure 1: Overview of Memory-R2. (a) Memory-R2 uses a shared-backbone extractor–manager [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Generalization of Memory-R2 across (a) OOD benchmarks, (b) backbone sizes, and (c) [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: LoGo-GRPO and curriculum learning are both essential. (a,b) LoGo-GRPO consis￾tently outperforms GRPO across curriculum stages. (c,d) Curriculum training remains stable under equal compute, whereas direct 32-session training collapses validation F1 from 0.47 to 0.27 and increases M-Fail to 72.1%. 10.3 to 46.8, suggesting that our training paradigm is particularly beneficial for smaller-capacity models, for … view at source ↗
Figure 4
Figure 4. Figure 4: Inference efficiency and compression penalty analysis. (a,b) Accuracy–latency trade￾off measured by F1 vs. time per conversation and per generated token. (c,d) Effect of λcomp ∈ {0, 0.1, 0.3, 0.5} on F1 and BLEU-1; the yellow band marks λcomp = 0.3, and rings mark the best value. forms (44.31 F1), supporting both explicit role decomposition and parameter sharing. Alternative interaction depths likewise und… view at source ↗
Figure 5
Figure 5. Figure 5: LoGo-GRPO training pipeline for memory manager. Memory bank construction via [PITH_FULL_IMAGE:figures/full_fig_p013_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompt template for atomic fact extraction. Each extracted fact is a self-contained, third-person statement tagged with the originating dia_id, and is then passed to the memory manager ( [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template for the memory manager. The model receives the current memory store and a batch of atomic facts (output of the fact-retrieval stage, Appendix B.1) and emits a JSON list of INSERT/UPDATE/DELETE edits. A fixed decision order, an atomicity constraint, and explicit non-destructive update semantics together prevent the common failure modes of LLM-based memory writers, namely fact loss, duplicate… view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template used for memory-based question answering. Double-braced tokens denote runtime placeholders. Model outputs are parsed from the <answer>...</answer> span and scored with SQuAD-style token F1. C Evaluation Metrics C.1 LLM-as-a-Judge In addition to F1, B1, we report an LLM-as-a-Judge (J) score that captures semantic equivalence between the generated answer and the gold answer, mitigating the we… view at source ↗
Figure 9
Figure 9. Figure 9: shows the full prompt. Prompt template: LLM-as-a-Judge Your task is to label an answer to a question as ’CORRECT’ or ’WRONG’. You will be given: (1) a question, (2) a gold (ground truth) answer, (3) a generated answer. The gold answer is usually concise; the generated answer may be longer. Be generous: if the generated answer touches on the same topic/date as the gold, count CORRECT. Different formats for … view at source ↗
Figure 10
Figure 10. Figure 10: LoGo-GRPO consistently outperforms GRPO across all question types and cur￾riculum stages. Judge accuracy (J) at curriculum stages 8→16→ 32 sessions, broken down by question type: (a) Single-hop, (b) Multi-hop, (c) Temporal, (d) Open-domain. LoGo-GRPO (blue) dominates GRPO (gray) at every stage and on every category, with the shaded band visualizing the gap, indicating that local rerollouts constantly miti… view at source ↗
Figure 11
Figure 11. Figure 11: Curriculum learning is essential for stable long-horizon training. Training dynamics of curriculum 8 → 16 → 32 sessions (blue) vs. direct 32-session training (orange) under equal compute. The x-axis is the cumulative epochs within the curriculum; direct-32sess is linearly stretched onto the same axis for fair comparison. (a) Validation F1 on LoCoMo: the curriculum stabilizes around 0.50, while direct-32se… view at source ↗
Figure 12
Figure 12. Figure 12: Latency-mechanism diagnostics. (a) Memory-R2 improves F1 while reducing per [PITH_FULL_IMAGE:figures/full_fig_p026_12.png] view at source ↗
read the original abstract

Memory-augmented LLM agents enable interactions that extend beyond finite context windows by storing, updating, and reusing information across sessions. However, training such agents with reinforcement learning in multi-session environments is challenging because memory turns the agent's past actions into part of its future environment. Once different rollouts write, update, or delete different memories, they no longer share the same intermediate memory state, making trajectory-level comparisons fundamentally unfair. This violates a key assumption behind group-relative methods such as GRPO, where rollouts are compared as if they were sampled from the same effective environment. Consequently, trajectory-level rewards provide noisy or biased credit signals for long-horizon memory operations. To address this challenge, we introduce Memory-R2, a training framework for long-horizon memory-augmented LLM agents. Its core algorithm, LoGo-GRPO, combines local and global group-relative optimization. The global objective preserves end-to-end learning from long-horizon trajectory-level rewards, while local rerollouts compare different memory-operation outcomes from the same intermediate memory state, yielding fairer group comparisons and more precise supervision for memory construction. Beyond credit assignment, Memory-R2 jointly optimizes memory formation and memory evolution with a shared-parameter co-learning design, where a fact extractor and a memory manager are instantiated from the same LLM backbone through role-specific prompts. To stabilize multi-step RL over long memory horizons, we adopt a progressive curriculum that increases the training horizon from 8 to 16 to 32 sessions. Together, these components provide an effective training paradigm for memory-augmented LLM agents in long-horizon multi-session settings.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript introduces Memory-R2, a training framework for long-horizon memory-augmented LLM agents. It diagnoses that memory updates across rollouts destroy the shared-environment assumption required by group-relative policy optimization methods such as GRPO, producing biased trajectory-level credit signals. The core algorithm LoGo-GRPO combines a global term that retains end-to-end learning from full-horizon rewards with local rerollouts that branch different memory operations from an identical intermediate memory state. The framework further employs shared-parameter co-learning (fact extractor and memory manager instantiated from the same LLM via role-specific prompts) and a progressive curriculum that scales the training horizon from 8 to 16 to 32 sessions.

Significance. If the empirical claims are substantiated, the local-global decomposition in LoGo-GRPO could supply a practical route to fairer credit assignment for persistent memory operations without sacrificing long-horizon optimization. The shared-parameter co-learning design and curriculum are sensible engineering choices that directly target joint optimization and training stability; together they address a concrete obstacle in scaling RL to memory-augmented agents.

major comments (1)
  1. [Abstract (LoGo-GRPO paragraph)] Abstract (LoGo-GRPO paragraph): the claim that local rerollouts from a shared intermediate memory state deliver fairer comparisons and more precise supervision for memory construction presupposes that outcome differences can be attributed primarily to the memory operation itself. Because the LLM policy is conditioned on memory contents, any alteration immediately shifts the action distribution, subsequent observations, and future memory updates. The manuscript does not describe importance weighting, downstream-policy freezing, or averaging over policy stochasticity to isolate the memory-operation effect; without such controls the local groups risk confounding memory credit with policy response, an issue that grows with horizon length.
minor comments (2)
  1. [Abstract] The abstract would be strengthened by a concise statement of the experimental domains, baselines, and key quantitative results so that readers can immediately gauge empirical support.
  2. [Method] Explicit pseudocode or equations defining the local and global objectives of LoGo-GRPO (including how the two terms are combined and how local groups are sampled) would improve clarity and reproducibility.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their careful reading and constructive feedback on our manuscript. The major comment raises a valid point regarding potential confounding in the local rerollouts, which we address below with a revision to improve clarity.

read point-by-point responses
  1. Referee: [Abstract (LoGo-GRPO paragraph)] Abstract (LoGo-GRPO paragraph): the claim that local rerollouts from a shared intermediate memory state deliver fairer comparisons and more precise supervision for memory construction presupposes that outcome differences can be attributed primarily to the memory operation itself. Because the LLM policy is conditioned on memory contents, any alteration immediately shifts the action distribution, subsequent observations, and future memory updates. The manuscript does not describe importance weighting, downstream-policy freezing, or averaging over policy stochasticity to isolate the memory-operation effect; without such controls the local groups risk confounding memory credit with policy response, an issue that grows with horizon length.

    Authors: We agree that local rerollouts from a shared memory state do not fully isolate the memory operation from downstream policy effects, since the updated memory immediately conditions the LLM policy and influences subsequent actions and observations. This confounding is an inherent feature of memory-augmented agents rather than an artifact of our method. The primary goal of the local groups in LoGo-GRPO is to eliminate the more severe bias that arises when trajectories are compared across entirely divergent memory histories (as occurs in standard GRPO), by ensuring identical starting memory states for the memory-operation branches. We do not employ importance weighting, downstream-policy freezing, or explicit averaging over stochasticity in the current design, as these would increase computational cost in long-horizon settings. We have revised Section 3.2 to explicitly discuss this assumption, the remaining confounding risk, and why the local-global combination still yields fairer credit assignment than baselines. Empirical results in the paper support the practical benefit of this approach. revision: yes

Circularity Check

0 steps flagged

No significant circularity detected; algorithmic proposal is self-contained

full rationale

The paper identifies a concrete problem with standard group-relative methods like GRPO when applied to memory-augmented agents: divergent memory states across rollouts violate the shared-environment assumption required for fair trajectory comparisons. It then defines LoGo-GRPO as an explicit combination of local rerollouts (branching memory operations from a fixed intermediate state) plus a retained global objective. This construction is presented as a direct response to the stated problem rather than a derivation that reduces to fitted parameters, self-citations, or prior results by the authors. The shared-parameter co-learning design and progressive curriculum are likewise introduced as engineering choices for stability, not as quantities whose justification collapses into the inputs. No equations or load-bearing claims in the provided text reduce the central result to its own definitions or to self-referential citations.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the assumption that memory states can be held fixed for local rerollouts and that progressive horizon lengthening stabilizes training; no explicit free parameters or invented physical entities are stated.

axioms (1)
  • domain assumption Different rollouts that modify memory no longer share the same intermediate state, violating the equal-environment assumption of group-relative methods.
    Stated directly in the abstract as the core challenge.

pith-pipeline@v0.9.0 · 5845 in / 1148 out tokens · 20771 ms · 2026-05-22T09:14:42.650286+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

64 extracted references · 64 canonical work pages · 14 internal anchors

  1. [1]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  2. [2]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  3. [3]

    Memory in the Age of AI Agents

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, et al. Memory in the age of ai agents.arXiv preprint arXiv:2512.13564, 2025

  4. [4]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516

  5. [5]

    Memory os of ai agent, 2025

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory os of ai agent, 2025. URL https://arxiv.org/abs/2506.06326

  6. [6]

    Cam: A constructivist view of agentic memory for llm-based reading comprehension,

    Rui Li, Zeyu Zhang, Xiaohe Bo, Zihang Tian, Xu Chen, Quanyu Dai, Zhenhua Dong, and Ruim- ing Tang. Cam: A constructivist view of agentic memory for llm-based reading comprehension,

  7. [7]

    URLhttps://arxiv.org/abs/2510.05520

  8. [8]

    Long- context llms struggle with long in-context learning,

    Tianle Li, Ge Zhang, Quy Duc Do, Xiang Yue, and Wenhu Chen. Long-context llms struggle with long in-context learning, 2024. URLhttps://arxiv.org/abs/2404.02060

  9. [9]

    Memos: An operating system for memory-augmented generation (mag) in large language models, 2025

    Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, Junpeng Ren, Zehao Lin, Jiahao Huo, Tianyi Chen, 10 Kai Chen, Kehang Li, Zhiqiang Yin, Qingchen Yu, Bo Tang, Hongkang Yang, Zhi-Qin John Xu, and Feiyu Xiong. Memos: An operating system for memory-augmented generation (mag) in large la...

  10. [10]

    A comprehensive survey on long context language modeling, 2025

    Jiaheng Liu, Dawei Zhu, Zhiqi Bai, Yancheng He, Huanxuan Liao, Haoran Que, Zekun Wang, Chenchen Zhang, Ge Zhang, Jiebin Zhang, Yuanxing Zhang, Zhuo Chen, Hangyu Guo, Shilong Li, Ziqiang Liu, Yong Shan, Yifan Song, Jiayi Tian, Wenhao Wu, Zhejian Zhou, Ruijie Zhu, Junlan Feng, Yang Gao, Shizhu He, Zhoujun Li, Tianyu Liu, Fanyu Meng, Wenbo Su, Yingshui Tan, ...

  11. [11]

    Evaluating Very Long-Term Conversational Memory of LLM Agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents, 2024. URL https://arxiv.org/abs/2402.17753

  12. [12]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. URL https: //arxiv.org/abs/2310.08560

  13. [13]

    ToolRL: Reward is All Tool Learning Needs

    Cheng Qian, Emre Can Acikgoz, Qi He, Hongru Wang, Xiusi Chen, Dilek Hakkani-Tür, Gokhan Tur, and Heng Ji. Toolrl: Reward is all tool learning needs, 2025. URL https: //arxiv.org/abs/2504.13958

  14. [14]

    Zep: A Temporal Knowledge Graph Architecture for Agent Memory

    Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory, 2025. URL https://arxiv.org/ abs/2501.13956

  15. [15]

    HybridFlow: A Flexible and Efficient RLHF Framework

    Guangming Sheng, Chi Zhang, Zilingfeng Ye, Xibin Wu, Wang Zhang, Ru Zhang, Yanghua Peng, Haibin Lin, and Chuan Wu. Hybridflow: A flexible and efficient rlhf framework.arXiv preprint arXiv: 2409.19256, 2024

  16. [16]

    Membench: Towards more comprehensive evaluation on the memory of llm-based agents, 2025

    Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents, 2025. URL https://arxiv.org/abs/2506.21605

  17. [17]

    Rema: Learning to meta-think for llms with multi-agent reinforcement learning, 2025

    Ziyu Wan, Yunxiang Li, Xiaoyu Wen, Yan Song, Hanjing Wang, Linyi Yang, Mark Schmidt, Jun Wang, Weinan Zhang, Shuyue Hu, and Ying Wen. Rema: Learning to meta-think for llms with multi-agent reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09501

  18. [18]

    Mem-{\alpha}: Learning Memory Construction via Reinforcement Learning

    Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. Mem-{\alpha}: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

  19. [19]

    Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning, 2025

    Zhepei Wei, Wenlin Yao, Yao Liu, Weizhi Zhang, Qin Lu, Liang Qiu, Changlong Yu, Puyang Xu, Chao Zhang, Bing Yin, Hyokun Yun, and Lihong Li. Webagent-r1: Training web agents via end-to-end multi-turn reinforcement learning, 2025. URL https://arxiv.org/abs/2505. 16421

  20. [20]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory, 2025. URL https://arxiv.org/abs/2410.10813

  21. [21]

    Beyond goldfish memory: Long-term open- domain conversation

    Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldfish memory: Long-term open- domain conversation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio, ed- itors,Proceedings of the 60th Annual Meeting of the Association for Computational Lin- guistics (V olume 1: Long Papers), pages 5180–5197, Dublin, Ireland, May 2022. Associ- ation for Comput...

  22. [22]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents, 2025. URLhttps://arxiv.org/abs/2502.12110. 11

  23. [23]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Kristian Kersting, Jeff Z Pan, Hinrich Schütze, et al. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

  24. [24]

    ReAct: Synergizing Reasoning and Acting in Language Models

    Shunyu Yao, Jeffrey Zhao, Dian Yu, Nan Du, Izhak Shafran, Karthik Narasimhan, and Yuan Cao. React: Synergizing reasoning and acting in language models, 2023. URL https: //arxiv.org/abs/2210.03629

  25. [25]

    MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, and Hao Zhou. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent, 2025. URL https://arxiv.org/ abs/2507.02259

  26. [26]

    G- memory: Tracing hierarchical memory for multi-agent systems, 2025

    Guibin Zhang, Muxin Fu, Guancheng Wan, Miao Yu, Kun Wang, and Shuicheng Yan. G- memory: Tracing hierarchical memory for multi-agent systems, 2025. URLhttps://arxiv. org/abs/2506.07398

  27. [27]

    Memorybank: Enhancing large language models with long-term memory, 2023

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory, 2023. URL https://arxiv.org/abs/2305. 10250

  28. [28]

    Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents, 2025

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents, 2025. URL https://arxiv.org/abs/2506. 15841. 12 A Additional Implementation Details Figure 5: LoGo-GRPO training pipeline for memory manager. Me...

  29. [29]

    Personal Preferences: Likes, dislikes, favorites, and opinions (food, entertainment, products, sports teams)

  30. [30]

    Important Personal Details: Names, relationships, family structure, durations, and significant life facts

  31. [31]

    Plans and Intentions: Explicit future goals, plans, or intentions stated by the speaker. 14

  32. [32]

    Activities and Routines: Travel experiences, visited places, recurring habits, physical activities, hobbies with specific context

  33. [33]

    Health and Wellness (NON-DIAGNOSTIC): Wellness-related experiences or preferences (do NOT infer or store diagnoses)

  34. [34]

    Professional Details: Job titles, career goals, professional interests, work habits

  35. [35]

    John

    Miscellaneous Meaningful Facts: Books, movies, creative work, projects, notable activities. CORE EXTRACTION RULES: - Extract facts from the provided dialogue turns for BOTH speakers. - Ignore system-level instructions and any non-dialogue control text. - Ignore small talk, greetings, generic statements, opinions without substance, and common knowledge. - ...

  36. [36]

    John

    INSERT: If the fact contains new information not captured in its ‘related_memory_ids‘, then you have to add it. - Assign ‘speaker‘ as who the fact is ABOUT. - Assign ‘content‘ as a concise summary in third person. - Keep tense faithful to the source fact (past events may stay past tense). - Do NOT assign ‘memory_id‘ for INSERT operations; the system will ...

  37. [37]

    memories

    UPDATE: Use UPDATE only when the new fact clearly refers to the SAME entity or event as an entry in its ‘related_memory_ids‘ and ADDS detail, refinement, or correction WITHOUT removing prior facts. - NEVER remove existing factual information during an UPDATE. - If the new fact is more specific, merge it with the existing content. - If both convey the same...

  38. [38]

    memories

    DELETE: Use DELETE only when a new fact explicitly contradicts and invalidates an entry in its ‘related_memory_ids‘. - Do NOT delete memories just because they are old or less relevant. - Please note to return the IDs in the output from the input IDs only and do not generate any new ID. Example: - Input: { "memories": [{"memory_id": "6v0k193d", "speaker":...

  39. [39]

    memories

    NO OPERATION: If the new fact is already captured by an entry in its ‘related_memory_ids‘ -- even if worded differently -- do NOT insert a new entry. Before deciding INSERT, look up the fact’s ‘related_memory_ids‘ in "memories" and check for semantic overlap: same person, same topic, same meaning. If a semantically equivalent memory already exists -> NO O...

  40. [40]

    Does the new fact explicitly contradict a memory entry in its ‘related_memory_ids‘? -> DELETE the contradicted entry

  41. [41]

    Does a semantically equivalent entry already exist in ‘related_memory_ids‘ (same person, same topic, same meaning)? -> NO OPERATION. Stop

  42. [42]

    Does an entry in ‘related_memory_ids‘ exist and the new fact refines, progresses, or confirms the same entity’s story? -> UPDATE. Stop

  43. [43]

    memories

    No matching entry found -> INSERT. Follow the instruction mentioned below: 21 - Memory is MONOTONIC: factual information must never be lost unless explicitly contradicted. - UPDATE operations MUST preserve all previously stored factual claims. An UPDATE must preserve all existing factual claims, but may rephrase them concisely within size limits. - Do not...

  44. [44]

    Every ‘content‘ is understandable alone

  45. [45]

    Every ‘content‘ explicitly names the subject speaker

  46. [46]

    No unresolved vague pronouns remain

  47. [47]

    last year

    No entry is only a conversational act without durable fact value. Do not return anything except the JSON format. Figure 7:Prompt template for the memory manager. The model receives the current memory store and a batch of atomic facts (output of the fact-retrieval stage, Appendix B.1) and emits a JSON list of INSERT/UPDATE/DELETEedits. A fixed decision ord...

  48. [48]

    Carefully analyze all provided memories from both speakers 22

  49. [49]

    Pay special attention to the timestamps to determine the answer

  50. [50]

    If the question asks about a specific event or fact, look for direct evidence in the memories

  51. [51]

    If the memories contain contradictory information, prioritize the most recent memory

  52. [52]

    last year

    If there is a question about time references (like "last year", "two months ago", etc.), calculate the actual date based on the memory timestamp. For example, if a memory from 4 May 2022 mentions "went to India last year," then the trip occurred in 2021

  53. [53]

    last year

    Always convert relative time references to specific dates, months, or years. For example, convert "last year" to "2022" or "two months ago" to "March 2023" based on the memory timestamp. Ignore the reference while answering the question

  54. [54]

    Do not confuse character names mentioned in memories with the actual users who created those memories

    Focus only on the content of the memories from both speakers. Do not confuse character names mentioned in memories with the actual users who created those memories

  55. [55]

    If memories are insufficient and the question is about a general world fact, you may use reliable general world knowledge

  56. [56]

    # APPROACH (Think step by step):

    Keep the final answer concise, typically no more than 10-12 words; do not omit essential entities or dates. # APPROACH (Think step by step):

  57. [57]

    First, examine all memories that contain information related to the question

  58. [58]

    Examine the timestamps and content of these memories carefully

  59. [59]

    Look for explicit mentions of dates, times, locations, or events that answer the question

  60. [60]

    If the answer requires calculation (e.g., converting relative time references), show your work

  61. [61]

    Formulate a precise, concise answer based on the evidence in the memories, using general world knowledge only if memories are insufficient

  62. [62]

    Double-check that your answer directly addresses the question asked

  63. [63]

    Ensure your final answer is specific and avoids vague time references

  64. [64]

    May 7”versus“7 May

    Output the final answer only in this format, with no extra text: <answer>YOUR_FINAL_ANSWER</answer> Memories for user speaker_1: speaker_1_memories Memories for user speaker_2: speaker_2_memories Question: question Answer step by step, and output the final answer in this format, with no extra text: <answer>YOUR_FINAL_ANSWER</answer> Figure 8:Prompt templa...