AdaMem: Learning What to Remember for Personalized Long-Horizon LLM Agents

Liefeng Bo; Rui Wang; Xingyu Chen; Zhaopeng Tu

arxiv: 2606.21144 · v1 · pith:42FJ3V2Tnew · submitted 2026-06-19 · 💻 cs.CL · cs.AI

AdaMem: Learning What to Remember for Personalized Long-Horizon LLM Agents

Xingyu Chen , Rui Wang , Zhaopeng Tu , Liefeng Bo This is my paper

Pith reviewed 2026-06-26 14:05 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords long-term memoryLLM agentsadaptive memorypersonalizationmemory policyquestion answeringfeedback refinement

0 comments

The pith

AdaMem learns what to remember from user feedback to prevent memory bloat in long-horizon LLM agents

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that uniform memory extraction leads to bloat in LLM agents, crowding out useful information. AdaMem instead learns a role-specific Memory Policy refined from weekly QA feedback using self-reflection with rollback. Tested on AdaMem-Bench simulating weekly interactions, it boosts QA accuracy by up to 9% and reduces memory volume by 9% over the baseline. This matters because it allows agents to maintain personalized long-term memory efficiently within context limits.

Core claim

AdaMem maintains a structured, role-specific Memory Policy and refines it from weekly QA feedback through a lightweight, patch-style self-reflection step with failure rollback, leading to improved QA accuracy and reduced memory volume compared to uniform extraction.

What carries the argument

The role-specific Memory Policy, which is adapted via patch-style self-reflection from QA feedback to decide what information to retain.

If this is right

Long-horizon agents can sustain higher QA accuracy over weeks of interaction.
Memory systems can adapt to individual user preferences without storing all facts.
Production LLM agents incur lower inference costs due to smaller memory stores.
Feedback-driven refinement reduces the impact of irrelevant trivia on agent performance.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar feedback mechanisms could improve memory management in multi-user or multi-agent settings.
The method might extend to other types of feedback beyond QA, such as direct user corrections.
Testing on real-world diverse interactions could reveal if the benchmark overestimates gains due to its structured nature.

Load-bearing premise

The assumption that weekly QA feedback is a reliable, unbiased, and representative signal for refining the memory policy without introducing selection effects or overfitting to the benchmark questions.

What would settle it

A controlled experiment where feedback is provided on non-benchmark questions or with noise, showing whether accuracy gains disappear or memory volume fails to shrink.

Figures

Figures reproduced from arXiv: 2606.21144 by Liefeng Bo, Rui Wang, Xingyu Chen, Zhaopeng Tu.

**Figure 1.** Figure 1: Motivating results on AdaMem-Bench (DeepSeek-V4-Flash, Explicit feedback). Accuracy over 10 weeks (smoothed with window w = 3). Ideal Memory and AdaMem maintain high performance, while the uniform Mem0 baseline degrades over time. assumption that more is better. In real-world deployment, this collides with two hard constraints— inference cost and a finite context budget—since every memory that is later re… view at source ↗

**Figure 2.** Figure 2: Analysis of retrieval dynamics and policy convergence over a 10-week horizon. (a) Recall@5 (left) and [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

read the original abstract

Long-term memory systems for Large Language Model (LLM) agents typically try to \emph{remember everything}, extracting memories uniformly to retain as many facts as possible. In production, however, inference cost and finite context budgets make this untenable: beyond consolidating raw dialogue into memory, an agent must exert \emph{write control}, efficiently keeping only the information each user actually cares about. Otherwise, long-horizon personalized interactions suffer \emph{memory bloat}, where irrelevant trivia crowds out useful information and steadily erodes question-answering (QA) accuracy. We argue that what is worth remembering is role-dependent, and propose \textbf{AdaMem} (Adaptive Memory), a method that \emph{learns what to remember} for each user from feedback. AdaMem maintains a structured, role-specific Memory Policy and refines it from weekly QA feedback through a lightweight, patch-style self-reflection step with failure rollback. To study this setting, we build \textbf{AdaMem-Bench}, a benchmark that simulates weeks of interaction with week-by-week QA. Across two extraction models and two feedback modes, AdaMem improves QA accuracy by up to \textbf{+9.0\%} over the uniform Mem0 baseline while shrinking memory volume by \textbf{9\%}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AdaMem adds a feedback-driven adaptive memory policy for LLM agents that cuts bloat while lifting QA scores on their benchmark, but the evaluation leaves the independence of the feedback signal unverified.

read the letter

AdaMem targets memory bloat in long-horizon LLM agents by replacing uniform extraction with a role-specific Memory Policy that gets refined from weekly QA feedback. The update uses a lightweight self-reflection step plus rollback on failures. On AdaMem-Bench, which runs simulated weeks of interaction, the method beats the Mem0 baseline by up to 9% QA accuracy while trimming memory volume by 9% across two extractors and two feedback modes.

The concrete advance is the shift to learned, user-role-dependent write control instead of trying to retain everything. The self-reflection loop is a straightforward way to incorporate external signals without heavy retraining, and the benchmark itself is a reasonable attempt to make long-term personalization testable.

The main weakness is that the abstract supplies no information on statistical significance, run-to-run variance, or how feedback questions are kept separate from evaluation items. If the weekly QA used for updates overlaps with later test questions or if updates are only triggered on certain failure types, the reported gains could reflect benchmark construction rather than a generalizable policy. That concern from the stress-test note stands until the full paper shows explicit partitioning or controls.

The work is aimed at researchers and engineers building personalized agents who already deal with context limits and memory management. It is incremental but directly useful in that niche. The paper shows honest engagement with the production constraint even if the current evidence is thin on verification details.

I would bring it to a reading group to discuss the feedback mechanism. It deserves peer review so the experimental controls can be examined.

Referee Report

2 major / 2 minor

Summary. The paper claims that long-term memory for LLM agents suffers from bloat under uniform extraction and proposes AdaMem to learn a role-specific Memory Policy from weekly QA feedback via lightweight self-reflection with failure rollback. On the introduced AdaMem-Bench simulating week-by-week interactions, AdaMem yields up to +9.0% QA accuracy gains and 9% memory volume reduction versus the Mem0 uniform baseline across two extraction models and two feedback modes.

Significance. If the reported gains prove robust and the learned policy generalizes, the work would be significant for practical deployment of personalized long-horizon agents by replacing uniform memory consolidation with learned write control. The benchmark construction and patch-style update mechanism are concrete contributions that could be adopted or extended.

major comments (2)

[Abstract] Abstract: the central quantitative claims (+9.0% QA accuracy, 9% volume reduction) are presented without any mention of statistical significance, variance across runs, number of seeds, or exact benchmark construction details; this information is load-bearing for assessing whether the gains support the claim of a generalizable Memory Policy.
[§4] AdaMem-Bench and experimental setup (presumably §4): no description is given of how weekly feedback questions are sampled or partitioned relative to the held-out evaluation questions, nor of any controls for topic selection effects or overlap; without this, the self-reflection step could be fitting to benchmark-specific failures rather than learning role-dependent importance, directly undermining the generalizability claim.

minor comments (2)

The abstract introduces 'Memory Policy' and 'failure rollback' without a brief formal definition or pseudocode sketch, which would aid immediate understanding of the method.
[Abstract] The two feedback modes and two extraction models are referenced but not named in the abstract; listing them would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address each major comment below and commit to revisions that strengthen the reporting of statistical details and benchmark construction without misrepresenting our results.

read point-by-point responses

Referee: [Abstract] Abstract: the central quantitative claims (+9.0% QA accuracy, 9% volume reduction) are presented without any mention of statistical significance, variance across runs, number of seeds, or exact benchmark construction details; this information is load-bearing for assessing whether the gains support the claim of a generalizable Memory Policy.

Authors: We agree that the abstract would benefit from additional context on robustness. In the revision we will note that results are averaged over multiple seeds with standard deviations reported in the main text, and we will expand benchmark construction details in Section 4. This directly addresses the load-bearing information for the generalizability claim. revision: yes
Referee: [§4] AdaMem-Bench and experimental setup (presumably §4): no description is given of how weekly feedback questions are sampled or partitioned relative to the held-out evaluation questions, nor of any controls for topic selection effects or overlap; without this, the self-reflection step could be fitting to benchmark-specific failures rather than learning role-dependent importance, directly undermining the generalizability claim.

Authors: We acknowledge the description of sampling and partitioning is currently insufficient. Weekly feedback questions are drawn from that week's interactions while evaluation questions are strictly held out from subsequent weeks; we will add an explicit subsection in the revised §4 detailing the sampling procedure, train/eval partition, and topic-overlap controls to show the policy learns role-dependent importance rather than benchmark artifacts. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with external feedback, no derivations or self-referential predictions.

full rationale

The paper presents AdaMem as an empirical system that refines a memory policy from weekly QA feedback on AdaMem-Bench. No equations, first-principles derivations, fitted parameters renamed as predictions, or self-citation chains appear in the described method or results. The +9.0% accuracy and 9% volume claims are direct empirical measurements against a uniform baseline, not reductions to inputs by construction. The method is self-contained against external benchmarks and feedback signals.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the premise that feedback-driven self-reflection can reliably identify user-relevant facts and that the new benchmark faithfully simulates production long-horizon use; no free parameters or invented physical entities are described.

axioms (1)

domain assumption Weekly QA feedback supplies an unbiased and sufficient training signal for memory policy updates
Invoked to justify the self-reflection and rollback mechanism.

invented entities (1)

Memory Policy no independent evidence
purpose: Structured, role-specific write control for selective memory retention
Core component of AdaMem introduced to replace uniform extraction.

pith-pipeline@v0.9.1-grok · 5764 in / 1171 out tokens · 22345 ms · 2026-06-26T14:05:28.598714+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references

[1]

2023 , journal =

MemGPT: Towards LLMs as Operating Systems , author =. 2023 , journal =

2023
[2]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=
[3]

2023 , journal =

Reflexion: language agents with verbal reinforcement learning , author =. 2023 , journal =

2023
[4]

Retrieval-augmented generation for knowledge-intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Kuttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Riedel, Sebastian , journal=. Retrieval-augmented generation for knowledge-intensive
[5]

2023 , journal =

Judging llm-as-a-judge with mt-bench and chatbot arena , author =. 2023 , journal =

2023
[6]

Mem0: The Memory Layer for Personalized

Mem0 , year=. Mem0: The Memory Layer for Personalized
[7]

2024 , journal =

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author =. 2024 , journal =

2024
[8]

2024 , journal =

A survey on large language model based autonomous agents , author =. 2024 , journal =

2024
[9]

2023 , journal =

Self-Refine: Iterative Refinement with Self-Feedback , author =. 2023 , journal =

2023
[10]

2025 , journal =

Rap: Retrieval-augmented personalization for multimodal large language models , author =. 2025 , journal =

2025
[11]

Evaluating Very Long-Term Conversational Memory of

Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei , journal=. Evaluating Very Long-Term Conversational Memory of
[12]

2024 , journal =

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author =. 2024 , journal =

2024
[13]

2026 , journal =

EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models , author =. 2026 , journal =

2026
[14]

2025 , journal =

Memory OS of AI Agent , author =. 2025 , journal =

2025
[15]

2026 , journal =

All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution , author =. 2026 , journal =

2026
[16]

2026 , journal =

Beyond Dialogue Time: Temporal Semantic Memory for Personalized LLM Agents , author =. 2026 , journal =

2026
[17]

2025 , journal =

MemInsight: Autonomous Memory Augmentation for LLM Agents , author =. 2025 , journal =

2025
[18]

2026 , journal =

SimpleMem: Efficient Lifelong Memory for LLM Agents , author =. 2026 , journal =

2026
[19]

2025 , journal =

Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science , author =. 2025 , journal =

2025
[20]

2025 , journal =

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents , author =. 2025 , journal =

2025
[21]

2025 , journal =

Dynamic Affective Memory Management for Personalized LLM Agents , author =. 2025 , journal =

2025
[22]

2026 , journal =

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents , author =. 2026 , journal =

2026
[23]

2025 , journal =

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents , author =. 2025 , journal =

2025
[24]

2026 , journal =

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents , author =. 2026 , journal =

2026
[25]

2026 , journal =

Evoking User Memory: Personalizing LLM via Recollection-Familiarity Adaptive Retrieval , author =. 2026 , journal =

2026
[26]

2026 , journal =

A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty , author =. 2026 , journal =

2026

[1] [1]

2023 , journal =

MemGPT: Towards LLMs as Operating Systems , author =. 2023 , journal =

2023

[2] [2]

Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

Generative agents: Interactive simulacra of human behavior , author=. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , pages=

[3] [3]

2023 , journal =

Reflexion: language agents with verbal reinforcement learning , author =. 2023 , journal =

2023

[4] [4]

Retrieval-augmented generation for knowledge-intensive

Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and Petroni, Fabio and Karpukhin, Vladimir and Goyal, Naman and Kuttler, Heinrich and Lewis, Mike and Yih, Wen-tau and Riedel, Sebastian , journal=. Retrieval-augmented generation for knowledge-intensive

[5] [5]

2023 , journal =

Judging llm-as-a-judge with mt-bench and chatbot arena , author =. 2023 , journal =

2023

[6] [6]

Mem0: The Memory Layer for Personalized

Mem0 , year=. Mem0: The Memory Layer for Personalized

[7] [7]

2024 , journal =

MemoryBank: Enhancing Large Language Models with Long-Term Memory , author =. 2024 , journal =

2024

[8] [8]

2024 , journal =

A survey on large language model based autonomous agents , author =. 2024 , journal =

2024

[9] [9]

2023 , journal =

Self-Refine: Iterative Refinement with Self-Feedback , author =. 2023 , journal =

2023

[10] [10]

2025 , journal =

Rap: Retrieval-augmented personalization for multimodal large language models , author =. 2025 , journal =

2025

[11] [11]

Evaluating Very Long-Term Conversational Memory of

Maharana, Adyasha and Lee, Dong-Ho and Tulyakov, Sergey and Bansal, Mohit and Barbieri, Francesco and Fang, Yuwei , journal=. Evaluating Very Long-Term Conversational Memory of

[12] [12]

2024 , journal =

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory , author =. 2024 , journal =

2024

[13] [13]

2026 , journal =

EverMemBench: Benchmarking Long-Term Interactive Memory in Large Language Models , author =. 2026 , journal =

2026

[14] [14]

2025 , journal =

Memory OS of AI Agent , author =. 2025 , journal =

2025

[15] [15]

2026 , journal =

All-Mem: Agentic Lifelong Memory via Dynamic Topology Evolution , author =. 2026 , journal =

2026

[16] [16]

2026 , journal =

Beyond Dialogue Time: Temporal Semantic Memory for Personalized LLM Agents , author =. 2026 , journal =

2026

[17] [17]

2025 , journal =

MemInsight: Autonomous Memory Augmentation for LLM Agents , author =. 2025 , journal =

2025

[18] [18]

2026 , journal =

SimpleMem: Efficient Lifelong Memory for LLM Agents , author =. 2026 , journal =

2026

[19] [19]

2025 , journal =

Nemori: Self-Organizing Agent Memory Inspired by Cognitive Science , author =. 2025 , journal =

2025

[20] [20]

2025 , journal =

PersonaAgent: Bridging Memory and Action for Personalized LLM Agents , author =. 2025 , journal =

2025

[21] [21]

2025 , journal =

Dynamic Affective Memory Management for Personalized LLM Agents , author =. 2025 , journal =

2025

[22] [22]

2026 , journal =

Agentic Memory: Learning Unified Long-Term and Short-Term Memory Management for Large Language Model Agents , author =. 2026 , journal =

2026

[23] [23]

2025 , journal =

In Prospect and Retrospect: Reflective Memory Management for Long-term Personalized Dialogue Agents , author =. 2025 , journal =

2025

[24] [24]

2026 , journal =

EvolveMem:Self-Evolving Memory Architecture via AutoResearch for LLM Agents , author =. 2026 , journal =

2026

[25] [25]

2026 , journal =

Evoking User Memory: Personalizing LLM via Recollection-Familiarity Adaptive Retrieval , author =. 2026 , journal =

2026

[26] [26]

2026 , journal =

A Survey on the Security of Long-Term Memory in LLM Agents: Toward Mnemonic Sovereignty , author =. 2026 , journal =

2026