Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

Di Liang; Hanqi Yan; Lin Gui; Qinglin Zhu; Runcong Zhao; Yulan He; Zhanghao Hu

arxiv: 2602.02007 · v4 · submitted 2026-02-02 · 💻 cs.CL · cs.AI

Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation

Zhanghao Hu , Qinglin Zhu , Runcong Zhao , Di Liang , Hanqi Yan , Yulan He , Lin Gui This is my paper

Pith reviewed 2026-05-16 08:21 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords agent memoryretrieval augmented generationhierarchical memorydecoupling aggregationlong context QAmemory retrievalRAG for agentsinteraction history

0 comments

The pith

Agent memory retrieval improves when similar interactions are decoupled into distinct facts before aggregation into groups.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Standard RAG performs poorly on agent memory because interaction streams contain many highly correlated or near-duplicate spans, causing flat top-k retrieval to return redundant context while summary hierarchies lose distinguishing details. The paper argues for a decoupling-before-aggregation principle: isolate reusable facts, updates, and fine-grained differences first, then organize the results for retrieval. It builds xMemory as a revisable hierarchy that turns raw messages into segments, memory components, and aggregated groups using a sparsity-semantic faithfulness objective. At inference the system retrieves top-down, beginning with a compact set of complementary groups and components and expanding only when added evidence reduces uncertainty. Experiments across multiple LLMs on LoCoMo and PerLTQA benchmarks report higher answer quality together with lower inference token counts, backed by measurements of reduced redundancy and improved evidence coverage.

Core claim

Agent memory should follow the principle of decoupling before aggregation: the system first isolates reusable facts, updates, and distinguishing details from similar histories, and only then organises them for efficient retrieval. xMemory realises this by segmenting interaction history into local events, decoupling each segment into memory components, aggregating related components into high-level groups under a sparsity-semantic faithfulness objective, and maintaining the structure incrementally; retrieval proceeds top-down from groups to segments and raw messages only as needed.

What carries the argument

xMemory, a revisable hierarchical structure that segments messages into events, decouples them into components, aggregates related components into groups via sparsity-semantic faithfulness, and enables top-down retrieval that expands detail only when uncertainty remains.

If this is right

Redundancy in retrieved context drops because only complementary groups and components are selected first.
Answer quality rises on long-term QA tasks because distinguishing details stay accessible rather than blurred by summarisation.
Inference token usage falls since expansion to raw segments occurs only when additional evidence is required.
Memory can be updated incrementally without rebuilding the entire hierarchy when new interactions arrive.
Evidence density and coverage increase because the structure separates reusable facts from near-duplicates.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same decoupling step could be inserted into existing vector stores for any domain whose documents contain overlapping passages.
If the sparsity-semantic faithfulness objective generalises, it might replace simple top-k in other hierarchical retrieval pipelines.
Testing the method on tasks that require strict temporal ordering would show whether component-level separation preserves sequence information.
Scaling the hierarchy to multi-session agent histories could reveal whether group-level selection continues to control context length effectively.

Load-bearing premise

Agent memory forms a bounded coherent interaction stream containing many highly correlated or near-duplicate spans that make flat similarity retrieval suboptimal.

What would settle it

Running the same LoCoMo and PerLTQA experiments on interaction logs engineered to have low correlation between spans and observing no gains in answer quality or token efficiency would falsify the claimed advantage.

Figures

Figures reproduced from arXiv: 2602.02007 by Di Liang, Hanqi Yan, Lin Gui, Qinglin Zhu, Runcong Zhao, Yulan He, Zhanghao Hu.

**Figure 1.** Figure 1: From similarity top-k to structured retrieval for agent memory. Agent memory forms a coherent and highly correlated stream, where many spans are near duplicates; similarity top-k retrieval can therefore collapse and retrieve redundant chunks. xMemory organises memories into a hierarchy of intact units and performs structure-aware retrieval to produce a shorter but more answer-sufficient context. 2025). Des… view at source ↗

**Figure 2.** Figure 2: Overview of xMemory. xMemory couples memory structuring with top-down retrieval to address the mismatch between agent memory and the RAG pipeline. It organises a coherent stream into a hierarchy that disentangles episodic traces into semantic components while preserving intact units. A sparsity–semantics objective guides split and merge to keep the high-level organisation searchable and faithful. At retrie… view at source ↗

**Figure 3.** Figure 3: Ablation on LoCoMo with Qwen3-8B. We report BLEU and F1 on the left axis and Token/query on the right axis (lower is better). Memory-only uses our hierarchical memory structure with a simple similarity retriever, but disables both adaptive retrieval stages. +RepSel adds Stage I, which selects representative theme and semantic nodes on the high-level kNN graph. +UncSion adds Stage II, which admits episodi… view at source ↗

**Figure 5.** Figure 5: Structural plasticity vs. downstream QA. We report Avg. BLEU/F1 (left axis) and the dynamic reassignment ratio (right axis) for four construction settings. Freezing high-level restructuring (w/o merge&split) yields 0% reassignment and lower accuracy, while the full system performs substantial reassignment (44.91%) and achieves the best Avg. BLEU/F1. revealed or corrected information. Our memory manager ena… view at source ↗

read the original abstract

Standard Retrieval Augmented Generation (RAG) is poorly matched to agent memory. Unlike large heterogeneous corpora, agent memory forms a bounded and coherent interaction stream in which many spans are highly correlated or near duplicates. As a result, flat top-$k$ similarity retrieval often returns redundant context, while summary-centric hierarchies can blur the subtle details that distinguish one candidate from another. We argue that agent memory should follow the principle of decoupling before aggregation: the system should first isolate reusable facts, updates, and distinguishing details from similar histories, and only then organise them for efficient retrieval. Based on this principle, we propose xMemory, which constructs a revisable hierarchical memory structure from original messages to segments, memory components, and groups. xMemory segments interaction history into local events, decouples each segment into memory components, aggregates related components into high-level groups using a sparsity--semantic faithfulness objective, and maintains this structure incrementally as memory evolves. At inference time, xMemory retrieves top-down, first selecting a compact backbone of complementary groups and components, and then expanding to segments and raw messages only when additional evidence reduces the reader's uncertainty. Experiments on LoCoMo and PerLTQA across diverse open source and closed source LLMs show consistent gains in answer quality and inference token efficiency, supported by analyses of redundancy, evidence density, and coverage.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

xMemory's decoupling step before aggregation targets redundancy in agent memory streams better than flat RAG, but the gains rest on unverified LLM construction steps.

read the letter

This paper's main move is to treat agent memory as a coherent stream full of near-duplicates and argue that retrieval should decouple first: break history into segments, isolate reusable components, then aggregate into groups under a sparsity-semantic objective before doing top-down retrieval at inference. The hierarchy is meant to stay revisable and update incrementally as new messages arrive. That framing is clearer than most RAG extensions for this setting. It directly addresses why summaries blur distinctions and why plain top-k pulls too much overlap in chat-like histories. The incremental maintenance and the staged expansion from groups down to raw messages are practical details that could keep context compact without losing key facts. The abstract reports consistent gains on LoCoMo and PerLTQA for answer quality and token use across open and closed models, plus some supporting checks on redundancy and coverage. Those outcomes line up with the motivation if the numbers hold. The soft spot is exactly the one the stress-test flags: the decoupling and grouping are done by LLM calls, yet there is no reported error analysis, segmentation accuracy numbers, or ablation that swaps the LLM steps for oracle or rule-based versions. Without that, it is hard to know whether the reported improvements come from the hierarchy itself or from incidental prompt effects and dataset properties. The central claim would be stronger with those checks. This is aimed at people building memory layers for interactive LLM agents rather than general retrieval researchers. A reader working on production agents could extract the architectural pattern and test it themselves. It deserves a serious referee because the problem is real, the proposed structure is distinct from prior work cited, and the idea is concrete enough to evaluate once the construction reliability is shown. I would send it for review and ask specifically for ablations on the LLM-driven steps and full experimental details.

Referee Report

2 major / 2 minor

Summary. The paper claims that standard RAG is ill-suited to agent memory because interaction streams contain many correlated or near-duplicate spans; flat top-k retrieval yields redundancy while summary hierarchies lose distinguishing details. It proposes xMemory, which first segments history into local events, decouples each into memory components, aggregates related components into high-level groups via a sparsity-semantic faithfulness objective, and maintains the hierarchy incrementally. At inference it performs top-down retrieval (groups and components first, then segments and raw messages only as needed). Experiments on LoCoMo and PerLTQA across open- and closed-source LLMs are said to show consistent gains in answer quality and inference-token efficiency, backed by analyses of redundancy, evidence density, and coverage.

Significance. If the hierarchy construction proves reliable and the reported gains are reproducible with proper controls, the work would offer a concrete alternative to flat retrieval or lossy summarization for bounded, coherent agent memories. The decoupling-before-aggregation principle and the top-down retrieval strategy address a real tension between redundancy and detail preservation; the accompanying analyses of evidence density could become useful diagnostics for other memory systems.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: the claim of 'consistent gains in answer quality and inference token efficiency' is presented without any description of baselines, statistical tests, exact metrics, data splits, or variance across runs. Because the central superiority claim rests entirely on these unshown results, the absence of this information is load-bearing for evaluation.
[Method] Method (hierarchy construction): the revisable structure (segments → components → groups) is built via LLM calls under the sparsity-semantic faithfulness objective, yet no quantitative assessment of segmentation accuracy, component fidelity, or temporal drift is provided, nor are ablations that replace the LLM steps with oracle or rule-based alternatives. Without such checks it remains possible that observed gains arise from incidental prompt effects rather than the decoupling principle itself.

minor comments (2)

[Introduction / Method] The terms 'memory components' and 'high-level groups' are introduced as new entities without explicit comparison to prior memory representations (e.g., episodic vs. semantic buffers) or a clear formal definition of their boundaries.
[Method] The sparsity-semantic faithfulness objective is described at a high level; a short pseudocode or explicit loss formulation would clarify how the two terms are balanced during aggregation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

Thank you for the detailed and constructive review. We appreciate the emphasis on making the experimental claims more transparent and on validating the intermediate steps of hierarchy construction. We address each major comment below and commit to revisions that directly strengthen the manuscript without altering its core claims.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: the claim of 'consistent gains in answer quality and inference token efficiency' is presented without any description of baselines, statistical tests, exact metrics, data splits, or variance across runs. Because the central superiority claim rests entirely on these unshown results, the absence of this information is load-bearing for evaluation.

Authors: We agree that the abstract would be stronger with a concise description of the experimental protocol. In the revised version we will expand the abstract to name the baselines (flat top-k RAG and summary-centric hierarchies), the primary metrics (answer quality via accuracy and F1, inference token count), the datasets and splits (LoCoMo and PerLTQA), and the reporting convention (means with standard deviation across runs). The Experiments section already contains these details together with the redundancy, evidence-density, and coverage analyses; we will add explicit statistical significance tests (paired t-tests with p-values) and ensure variance is reported in all tables. These changes make the superiority claim self-contained while preserving the existing results. revision: yes
Referee: [Method] Method (hierarchy construction): the revisable structure (segments → components → groups) is built via LLM calls under the sparsity-semantic faithfulness objective, yet no quantitative assessment of segmentation accuracy, component fidelity, or temporal drift is provided, nor are ablations that replace the LLM steps with oracle or rule-based alternatives. Without such checks it remains possible that observed gains arise from incidental prompt effects rather than the decoupling principle itself.

Authors: This observation is correct and highlights a genuine gap. The current manuscript prioritizes end-to-end retrieval and QA performance over intermediate construction diagnostics. We will add a new analysis subsection that reports: (i) segmentation accuracy on a human-annotated subset of interaction histories, (ii) component fidelity measured by fact overlap with reference decompositions, (iii) temporal drift statistics across incremental updates, and (iv) ablations that replace the LLM-based segmentation and grouping steps with deterministic rule-based alternatives (sentence-boundary splitting and TF-IDF clustering). These additions will allow readers to assess whether the observed gains derive from the decoupling-before-aggregation principle or from prompt-specific effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method with independent design choices and external benchmarks

full rationale

The paper introduces xMemory as a hierarchical memory structure (segments to components to groups) built via an LLM-driven process under a sparsity-semantic faithfulness objective. No equations, derivations, or fitted parameters are present that reduce the claimed retrieval gains to the inputs by construction. The objective is presented as an explicit design choice rather than a fitted or self-referential quantity, and performance is evaluated on external datasets (LoCoMo, PerLTQA) across multiple LLMs with analyses of redundancy and coverage. No self-citations are invoked as load-bearing uniqueness theorems or ansatzes. The central claim therefore remains self-contained against external evidence rather than tautological.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The central claim rests on the domain assumption that agent memory is a coherent stream with many near-duplicates, plus newly introduced structural entities (components, groups) whose utility is justified only by the proposed method itself.

axioms (1)

domain assumption Agent memory forms a bounded and coherent interaction stream in which many spans are highly correlated or near duplicates
Explicitly stated in the abstract as the reason standard RAG fails for agents.

invented entities (2)

memory components no independent evidence
purpose: Isolated reusable facts, updates, and distinguishing details extracted from segments
New intermediate representation introduced by xMemory with no independent evidence outside the method.
high-level groups no independent evidence
purpose: Aggregates of related components for efficient top-down retrieval
New structural layer created by the sparsity-semantic faithfulness objective.

pith-pipeline@v0.9.0 · 5548 in / 1375 out tokens · 31764 ms · 2026-05-16T08:21:26.112869+00:00 · methodology

discussion (0)

Forward citations

Cited by 2 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

DimMem: Dimensional Structuring for Efficient Long-Term Agent Memory
cs.CL 2026-05 unverdicted novelty 6.0

DimMem introduces a dimensional memory framework that structures memories as typed atomic units to improve retrieval efficiency and accuracy for long-term LLM agent tasks.
Externalization in LLM Agents: A Unified Review of Memory, Skills, Protocols and Harness Engineering
cs.SE 2026-04 accept novelty 5.0

LLM agent progress depends on externalizing cognitive functions into memory, skills, protocols, and harness engineering that coordinates them reliably.

Reference graph

Works this paper leans on

16 extracted references · 16 canonical work pages · cited by 2 Pith papers · 1 internal anchor

[1]

Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

work page doi:10.18653/v1/2025.emnlp-main 2025
[2]

What Deserves Memory: Adaptive Memory Distillation for LLM Agents

URL https://aclanthology.org/2025. emnlp-main.1318/. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA, 2000. Morgan Kaufmann. Lin, C.-Y . ROUGE: A package for automatic evalua- tion of summaries. InText Summarization Branche...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/tacl 2025
[3]

doi:10.3115/1073083.1073135 , editor =

URL https://aclanthology.org/2024. findings-acl.57/. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Isabelle, P., Charniak, E., and Lin, D. (eds.),Proceed- ings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, Ju...

work page doi:10.3115/1073083.1073135 2024
[4]

Adversarial eval

doi: 10.1609/AAAI.V38I17.29946. URL https: //doi.org/10.1609/aaai.v38i17.29946. 11 Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation Table 4.Detailed ablation results on LoCoMo with Qwen3-8B.We report BLEU and F1 for each question category and the average. Token/query denotes the average total tokens per query (lower is better). Setting...

work page doi:10.1609/aaai.v38i17.29946 1979
[5]

**Topic Change** (Highest Priority): - Do the new messages introduce a completely different topic? - Is there a shift from one specific event to another? - Has the conversation moved from one question to an unrelated new question?

work page
[6]

**Intent Transition**: - Has the purpose of the conversation changed? (e.g., from casual chat to seeking help, from discussing work to discussing personal life) - Has the core question or issue of the current topic been answered or fully discussed?

work page
[7]

**Temporal Markers**: - Are there temporal transition markers (”earlier”, ”before”, ”by the way”, ”oh right”, ”also”, etc.)? - Is the time gap between messages more than 30 minutes?

work page
[8]

**Structural Signals**: - Are there explicit topic transition phrases (”changing topics”, ”speaking of which”, ”quick question”, etc.)? - Are there concluding statements indicating the current topic is finished?

work page
[9]

Please convert the following conversation into an episodic memory

**Content Relevance**: - How related is the new message to the previous discussion? (Consider splitting if relevance ¡ 30- Does it involve completely different people, places, or events? Decision Principles: - **Prioritize topic independence**: Each episode should revolve around one core topic or event - **When in doubt, split**: When uncertain, lean towa...

work page
[10]

**Format**: Always return timestamp in ISO format: ”2024-01-15T14:30:00” Requirements: 1

**Fallback**: If no time information is available, use a reasonable estimate based on context 4. **Format**: Always return timestamp in ISO format: ”2024-01-15T14:30:00” Requirements: 1. The title should be specific and easy to search (including key topics/activities). 2. The content must include all important information from the conversation. 3. Convert...

work page 2024
[11]

**Identity & Professional** - Names, titles, companies, roles - Education, qualifications, skills

work page
[12]

**Persistent Preferences** - Favorite books, movies, music, tools - Technology preferences with reasons - Long-term likes and dislikes

work page
[13]

**Technical Knowledge** - Technologies used (with versions) - Architectures, methodologies - Technical decisions and rationales

work page
[14]

**Relationships** - Names of family, colleagues, friends - Team structure, reporting lines - Professional networks

work page
[15]

**Goals & Plans** - Career objectives - Learning goals - Project plans

work page
[16]

**Patterns & Habits** - Regular activities - Workflows, schedules - Recurring challenges ## Examples: HIGH-V ALUE (Extract these): - ”Caroline’s favorite book is ’Becoming Nicole’ by Amy Ellis Nutt” - ”The user works at ByteDance as a senior ML engineer” - ”The user prefers PyTorch over TensorFlow for debugging” - ”The user’s team lead is named Sarah” - ”...

work page 2021

[1] [1]

Stolfo, A., Balachandran, V ., Yousefi, S., Horvitz, E., and Nushi, B

Association for Computational Linguistics. ISBN 979-8-89176-332-6. doi: 10.18653/v1/2025.emnlp-main

work page doi:10.18653/v1/2025.emnlp-main 2025

[2] [2]

What Deserves Memory: Adaptive Memory Distillation for LLM Agents

URL https://aclanthology.org/2025. emnlp-main.1318/. Langley, P. Crafting papers on machine learning. In Langley, P. (ed.),Proceedings of the 17th International Conference on Machine Learning (ICML 2000), pp. 1207–1216, Stan- ford, CA, 2000. Morgan Kaufmann. Lin, C.-Y . ROUGE: A package for automatic evalua- tion of summaries. InText Summarization Branche...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.1162/tacl 2025

[3] [3]

doi:10.3115/1073083.1073135 , editor =

URL https://aclanthology.org/2024. findings-acl.57/. Papineni, K., Roukos, S., Ward, T., and Zhu, W.-J. Bleu: a method for automatic evaluation of machine translation. In Isabelle, P., Charniak, E., and Lin, D. (eds.),Proceed- ings of the 40th Annual Meeting of the Association for Computational Linguistics, pp. 311–318, Philadelphia, Pennsylvania, USA, Ju...

work page doi:10.3115/1073083.1073135 2024

[4] [4]

Adversarial eval

doi: 10.1609/AAAI.V38I17.29946. URL https: //doi.org/10.1609/aaai.v38i17.29946. 11 Beyond RAG for Agent Memory: Retrieval by Decoupling and Aggregation Table 4.Detailed ablation results on LoCoMo with Qwen3-8B.We report BLEU and F1 for each question category and the average. Token/query denotes the average total tokens per query (lower is better). Setting...

work page doi:10.1609/aaai.v38i17.29946 1979

[5] [5]

**Topic Change** (Highest Priority): - Do the new messages introduce a completely different topic? - Is there a shift from one specific event to another? - Has the conversation moved from one question to an unrelated new question?

work page

[6] [6]

**Intent Transition**: - Has the purpose of the conversation changed? (e.g., from casual chat to seeking help, from discussing work to discussing personal life) - Has the core question or issue of the current topic been answered or fully discussed?

work page

[7] [7]

**Temporal Markers**: - Are there temporal transition markers (”earlier”, ”before”, ”by the way”, ”oh right”, ”also”, etc.)? - Is the time gap between messages more than 30 minutes?

work page

[8] [8]

**Structural Signals**: - Are there explicit topic transition phrases (”changing topics”, ”speaking of which”, ”quick question”, etc.)? - Are there concluding statements indicating the current topic is finished?

work page

[9] [9]

Please convert the following conversation into an episodic memory

**Content Relevance**: - How related is the new message to the previous discussion? (Consider splitting if relevance ¡ 30- Does it involve completely different people, places, or events? Decision Principles: - **Prioritize topic independence**: Each episode should revolve around one core topic or event - **When in doubt, split**: When uncertain, lean towa...

work page

[10] [10]

**Format**: Always return timestamp in ISO format: ”2024-01-15T14:30:00” Requirements: 1

**Fallback**: If no time information is available, use a reasonable estimate based on context 4. **Format**: Always return timestamp in ISO format: ”2024-01-15T14:30:00” Requirements: 1. The title should be specific and easy to search (including key topics/activities). 2. The content must include all important information from the conversation. 3. Convert...

work page 2024

[11] [11]

**Identity & Professional** - Names, titles, companies, roles - Education, qualifications, skills

work page

[12] [12]

**Persistent Preferences** - Favorite books, movies, music, tools - Technology preferences with reasons - Long-term likes and dislikes

work page

[13] [13]

**Technical Knowledge** - Technologies used (with versions) - Architectures, methodologies - Technical decisions and rationales

work page

[14] [14]

**Relationships** - Names of family, colleagues, friends - Team structure, reporting lines - Professional networks

work page

[15] [15]

**Goals & Plans** - Career objectives - Learning goals - Project plans

work page

[16] [16]

**Patterns & Habits** - Regular activities - Workflows, schedules - Recurring challenges ## Examples: HIGH-V ALUE (Extract these): - ”Caroline’s favorite book is ’Becoming Nicole’ by Amy Ellis Nutt” - ”The user works at ByteDance as a senior ML engineer” - ”The user prefers PyTorch over TensorFlow for debugging” - ”The user’s team lead is named Sarah” - ”...

work page 2021