GAAMA: Graph Augmented Associative Memory for Agents

Nitin Sareen; Shubhendu Sharma; Swarna Kamal Paul

arxiv: 2603.27910 · v2 · pith:KFUVCQD4new · submitted 2026-03-29 · 💻 cs.AI · cs.IR· cs.MA

GAAMA: Graph Augmented Associative Memory for Agents

Swarna Kamal Paul , Shubhendu Sharma , Nitin Sareen This is my paper

Pith reviewed 2026-05-14 21:05 UTC · model grok-4.3

classification 💻 cs.AI cs.IRcs.MA

keywords associative memoryknowledge graphAI agentslong-term memoryretrieval augmented generationmulti-session conversationsgraph retrievalmemory repair

0 comments

The pith

GAAMA builds a concept-mediated knowledge graph to improve memory retrieval for AI agents across multi-session conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that current memory systems for long-running AI agents either flatten all past interactions into a vector store, losing connections, or rely on entity-centric graphs that create overly central hubs from repeated names. GAAMA instead extracts atomic facts and topic concepts from preserved episodes, adds synthesized reflections, and links them with five edge types so that concept nodes provide bridging paths for relevance. Retrieval then scores candidates by combining vector similarity with an edge-aware graph walk, followed by a targeted repair step that fixes detected failures. If the structure holds, agents should sustain coherent personalized responses even as dialogue histories lengthen, without the performance drops seen in baselines.

Core claim

GAAMA constructs a concept-mediated knowledge graph through verbatim episode storage, LLM extraction of atomic facts and topic-level concepts, and synthesis of higher-order reflections, using four node types connected by five structural edge types so that concept nodes supply cross-cutting traversal paths, with retrieval performed by an additive function of cosine k-nearest-neighbor scores and edge-type-aware Personalized PageRank, plus a post-retrieval GRAFT repair layer.

What carries the argument

The concept-mediated knowledge graph whose concept nodes supply cross-cutting traversal paths among episode, fact, and reflection nodes.

If this is right

Performance advantages increase monotonically with dialogue length on MemoryArena tasks.
GAAMA matches the strongest method in every category while every baseline degrades in at least one.
The same graph supports both low-level fact recall and higher-level reflection reuse without separate stores.
GRAFT can surgically repair retrieval failures by augmenting facts or topology after an initial query.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same node-and-edge design could be applied to non-conversational agents that must track evolving world state over long horizons.
Because advantages widen with length, the method may reduce reliance on ever-larger context windows in the underlying language model.
Testing whether the monotonic length advantage continues beyond the current benchmark lengths would directly probe the scalability claim.
Replacing the LLM extraction step with a symbolic parser could isolate how much of the gain comes from the graph topology versus the quality of the extracted nodes.

Load-bearing premise

The LLM extraction of atomic facts, topic concepts, and reflections from raw episodes accurately preserves structural relationships without introducing errors that would degrade later retrieval.

What would settle it

An ablation that removes all concept nodes and their bridging edges, then re-runs retrieval on the LoCoMo-10 benchmark, would falsify the central claim if the resulting scores fall to or below the tuned RAG baseline.

read the original abstract

AI agents that interact with users across multiple sessions require persistent long-term memory to maintain coherent, personalized behavior. Current approaches either rely on flat retrieval-augmented generation (RAG), which loses structural relationships among memories, or use entity-centric knowledge graphs that suffer from mega-hub effects in conversational data, diluting graph-based relevance propagation. We propose GAAMA, a graph-augmented associative memory for agents that constructs a concept-mediated knowledge graph through a three-step pipeline: (1)verbatim episode preservation, (2)LLM-based extraction of atomic facts and topic-level concept nodes, and (3)synthesis of higher-order reflections. The resulting graph uses four node types (episode, fact, reflection, concept) connected by five structural edge types, with concept nodes providing cross-cutting traversal paths that avoid the mega-hub problem of entity-centric designs. Retrieval combines cosine-similarity-based k-nearest neighbor search with edge-type-aware Personalized PageRank (PPR) through an additive scoring function. We further introduce GRAFT (Graph Repair by Augmenting Facts & Topology), a post-retrieval corrective layer that diagnoses retrieval failures and surgically repairs the knowledge graph. On LoCoMo-10 (1,540 questions, 10 multi-session conversations), GAAMA achieves 79.1% mean reward, a +4.2~pp improvement over a tuned RAG baseline, the strongest comparator. On MemoryArena, GAAMA outperforms full-context baselines across three tasks - Group Travel (+0.4~pp), Web Shopping (+3.4~pp), and Progressive Search (+0.7~pp) - with advantages growing monotonically with dialogue length. Notably, GAAMA delivers consistent performance across all categories, matching the best competing method in each, whereas every competitor degrades in at least one category.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GAAMA's four-node graph with concept nodes and GRAFT repair shows modest benchmark gains over RAG, but the source of those gains is unclear without extraction ablations.

read the letter

GAAMA builds a graph memory for multi-session agents using four node types—episodes kept verbatim, atomic facts, higher-order reflections, and topic-level concepts—linked by five edge types. Concept nodes are meant to create cross-cutting paths that reduce the mega-hub concentration common in entity graphs. Retrieval blends cosine kNN with edge-aware Personalized PageRank, and a post-retrieval GRAFT layer diagnoses failures and patches the graph. On LoCoMo-10 the system reaches 79.1% mean reward, 4.2 points above a tuned RAG baseline, and on MemoryArena it beats full-context baselines with the margin widening as dialogues lengthen. The design is explicit enough that an engineer could reimplement the pipeline from the description. The consistent performance across task categories is a practical plus. The main gap is the lack of any measured accuracy for the LLM extraction of facts, concepts, and reflections, and no ablation that holds the LLM prompts fixed while removing the graph structure. Without those checks it is hard to tell whether the reported lift comes from the topology or simply from better-organized prompting. Error bars and statistical tests are also absent. This is aimed at teams shipping production agents that need reliable memory across sessions rather than theorists. Practitioners who want a concrete architecture to test will get value from the node choices and the repair step. The work is coherent on its own terms and grounded in real benchmarks, so it deserves a serious referee who can ask for the missing extraction metrics and ablations.

Referee Report

2 major / 1 minor

Summary. The manuscript introduces GAAMA, a graph-augmented associative memory for AI agents in multi-session settings. It builds a four-node-type concept-mediated knowledge graph (episodes, facts, reflections, concepts) via verbatim preservation, LLM extraction of atomic facts and topic-level concepts, and synthesis of higher-order reflections. Retrieval combines cosine kNN with edge-type-aware Personalized PageRank; a post-retrieval GRAFT layer diagnoses failures and repairs the graph. On LoCoMo-10 (1,540 questions across 10 conversations) GAAMA reports 79.1% mean reward (+4.2 pp over tuned RAG). On MemoryArena it shows small gains over full-context baselines that increase monotonically with dialogue length and remain consistent across task categories.

Significance. If the gains prove robustly attributable to the concept-mediated topology and PPR rather than unablated LLM components, GAAMA would provide a concrete mechanism for scalable long-term agent memory that avoids entity-centric mega-hub dilution. The reported monotonic scaling with dialogue length and cross-category consistency are potentially valuable empirical signals for long-horizon interaction.

major comments (2)

[Abstract] Abstract / LoCoMo-10 results: the headline 79.1% mean reward and +4.2 pp improvement are stated without error bars, number of runs, standard deviation, or any statistical test. This omission directly affects confidence in whether the margin over the tuned RAG baseline is reliable or could arise from variance or unreported tuning.
[Methods] Methods (three-step pipeline): the central claim that concept nodes and PPR traversal produce the observed advantage rests on the assumption that LLM extraction of facts, concepts, and reflections faithfully preserves structure. No extraction precision/recall figures, inter-annotator agreement, or ablation that removes concept nodes or reflections while keeping the same facts is reported, leaving open the possibility that gains derive from prompting quality rather than graph topology.

minor comments (1)

[Abstract] The five structural edge types are referenced but never enumerated; an explicit list with definitions would improve reproducibility.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive report. We address each major comment below and will revise the manuscript accordingly to strengthen the empirical claims.

read point-by-point responses

Referee: [Abstract] Abstract / LoCoMo-10 results: the headline 79.1% mean reward and +4.2 pp improvement are stated without error bars, number of runs, standard deviation, or any statistical test. This omission directly affects confidence in whether the margin over the tuned RAG baseline is reliable or could arise from variance or unreported tuning.

Authors: We agree that statistical rigor is essential for interpreting the reported gains. In the revised manuscript we will report results aggregated over five independent runs, include standard deviations and error bars on all LoCoMo-10 metrics, and add a paired t-test (or Wilcoxon signed-rank test) comparing GAAMA against the tuned RAG baseline to establish that the +4.2 pp margin is statistically significant. revision: yes
Referee: [Methods] Methods (three-step pipeline): the central claim that concept nodes and PPR traversal produce the observed advantage rests on the assumption that LLM extraction of facts, concepts, and reflections faithfully preserves structure. No extraction precision/recall figures, inter-annotator agreement, or ablation that removes concept nodes or reflections while keeping the same facts is reported, leaving open the possibility that gains derive from prompting quality rather than graph topology.

Authors: We acknowledge the absence of direct extraction-quality metrics and topology ablations in the original submission. In the revision we will (1) report precision/recall for fact and concept extraction against a human-annotated subset of 200 episodes, (2) include inter-annotator agreement (Cohen’s kappa) for the concept labeling step, and (3) add an ablation that retains the identical facts and reflections but removes all concept nodes and their incident edges, thereby isolating the contribution of the concept-mediated topology and PPR traversal from prompting effects. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical architecture evaluated on external benchmarks

full rationale

The paper describes a three-step pipeline for building a concept-mediated knowledge graph (verbatim episodes, LLM fact/concept extraction, reflection synthesis) and retrieval via kNN + edge-aware PPR plus GRAFT repair. Performance is reported as measured rewards on LoCoMo-10 (79.1%, +4.2 pp over tuned RAG) and MemoryArena tasks, with advantages scaling by dialogue length. No equations, fitted parameters, or predictions are defined inside the paper that are then re-used as outputs; the gains are external empirical comparisons. The LLM extraction step is a design choice whose fidelity is not quantified, but this is a correctness/robustness issue rather than a circular reduction of any claimed derivation to its own inputs. The result is self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review supplies no explicit free parameters, axioms, or invented entities; the central claim rests on the unstated assumption that LLM extraction produces reliable facts and concepts.

pith-pipeline@v0.9.0 · 5635 in / 1190 out tokens · 36604 ms · 2026-05-14T21:05:32.860871+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

18 extracted references · 18 canonical work pages

[1]

User’s birthday is March 15, 1990

Each fact must be a **single, specific, atomic claim** (e.g., "User’s birthday is March 15, 1990")

work page 1990
[2]

**Do NOT duplicate existing facts.** If an existing fact already captures the information, skip it

work page
[3]

2023-06-15

**Resolve relative dates to absolute dates** using the conversation timestamp. For example, if the conversation date is "2023-06-15" and the user says "last week", resolve to approximately "2023-06-08"

work page 2023
[4]

Derive general knowledge from episodes by doing multi-step reasoning where possible

work page
[5]

Only extract general knowledge, preferences, attributes, or relationships that can be applied broadly

Do not extract events or interactions as facts. Only extract general knowledge, preferences, attributes, or relationships that can be applied broadly

work page
[6]

Each fact should stand alone without requiring the original conversation for context

work page
[7]

## Part 2: Concepts ### Rules

For each fact, list which concept(s) it relates to (from the concepts you extract below). ## Part 2: Concepts ### Rules

work page
[8]

Concepts are short topic labels (2-5 words, snake_case) representing activities, events, topics, or themes

work page
[9]

**Good concepts**: camping_trip, adoption_process, beach_outing, charity_run, art_expression, career_transition, family_vacation, marathon_training

work page
[10]

**Do NOT use**: Person names, generic words (e.g., NOT family, life, experience, conversation, sharing), adjectives (e.g., NOT beautiful, amazing), dates

work page
[11]

Only create new concepts when no existing one fits

**Reuse existing concepts** when applicable. Only create new concepts when no existing one fits

work page
[12]

Each new episode should have 1-3 concepts

work page
[13]

facts": [ {

Each concept must be linked to the episode IDs it appears in. ## Output format (JSON only, no markdown fences) Return a single JSON object: {"facts": [ {"fact_text": "Melanie painted a lake sunrise in 2022", "belief": 0.95, "source_episode_ids": ["ep-abc123", "ep-def456"], "concepts": ["artistic_creation", "painting_hobby"]} 12 ], "concepts": [ {"concept_...

work page 2022
[14]

Each reflection should synthesize information from multiple facts when possible

work page
[15]

**Do NOT duplicate existing reflections.** If an existing reflection already captures the insight, skip it

work page
[16]

Reflections should be actionable or informative -- they should help in future interactions

work page
[17]

Each reflection should stand alone without requiring the original facts for context

work page
[18]

reflections

Only generate reflections when there is genuine insight to be drawn. It is perfectly fine to return zero reflections. ## Output format (JSON only, no markdown fences) Return a single JSON object: {"reflections": [ {"reflection_text": "User consistently prefers minimalist tools and configurations across all development environments", "belief": 0.8, "source...

work page

[1] [1]

User’s birthday is March 15, 1990

Each fact must be a **single, specific, atomic claim** (e.g., "User’s birthday is March 15, 1990")

work page 1990

[2] [2]

**Do NOT duplicate existing facts.** If an existing fact already captures the information, skip it

work page

[3] [3]

2023-06-15

**Resolve relative dates to absolute dates** using the conversation timestamp. For example, if the conversation date is "2023-06-15" and the user says "last week", resolve to approximately "2023-06-08"

work page 2023

[4] [4]

Derive general knowledge from episodes by doing multi-step reasoning where possible

work page

[5] [5]

Only extract general knowledge, preferences, attributes, or relationships that can be applied broadly

Do not extract events or interactions as facts. Only extract general knowledge, preferences, attributes, or relationships that can be applied broadly

work page

[6] [6]

Each fact should stand alone without requiring the original conversation for context

work page

[7] [7]

## Part 2: Concepts ### Rules

For each fact, list which concept(s) it relates to (from the concepts you extract below). ## Part 2: Concepts ### Rules

work page

[8] [8]

Concepts are short topic labels (2-5 words, snake_case) representing activities, events, topics, or themes

work page

[9] [9]

**Good concepts**: camping_trip, adoption_process, beach_outing, charity_run, art_expression, career_transition, family_vacation, marathon_training

work page

[10] [10]

**Do NOT use**: Person names, generic words (e.g., NOT family, life, experience, conversation, sharing), adjectives (e.g., NOT beautiful, amazing), dates

work page

[11] [11]

Only create new concepts when no existing one fits

**Reuse existing concepts** when applicable. Only create new concepts when no existing one fits

work page

[12] [12]

Each new episode should have 1-3 concepts

work page

[13] [13]

facts": [ {

Each concept must be linked to the episode IDs it appears in. ## Output format (JSON only, no markdown fences) Return a single JSON object: {"facts": [ {"fact_text": "Melanie painted a lake sunrise in 2022", "belief": 0.95, "source_episode_ids": ["ep-abc123", "ep-def456"], "concepts": ["artistic_creation", "painting_hobby"]} 12 ], "concepts": [ {"concept_...

work page 2022

[14] [14]

Each reflection should synthesize information from multiple facts when possible

work page

[15] [15]

**Do NOT duplicate existing reflections.** If an existing reflection already captures the insight, skip it

work page

[16] [16]

Reflections should be actionable or informative -- they should help in future interactions

work page

[17] [17]

Each reflection should stand alone without requiring the original facts for context

work page

[18] [18]

reflections

Only generate reflections when there is genuine insight to be drawn. It is perfectly fine to return zero reflections. ## Output format (JSON only, no markdown fences) Return a single JSON object: {"reflections": [ {"reflection_text": "User consistently prefers minimalist tools and configurations across all development environments", "belief": 0.8, "source...

work page