arxiv: 2604.21229 · v1 · submitted 2026-04-23 · 💻 cs.CL · cs.AI

Recognition: unknown

EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

Julian Acuna

Authors on Pith no claims yet

Pith reviewed 2026-05-09 22:22 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords long-term memoryconversational memorygraph retrievalLLM benchmarksmemory architecturescross-space reasoningvector retrieval

0 comments

The pith

Engrama's graph-structured memory outperforms full-context prompting on cross-space reasoning in long-term conversations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EngramaBench to assess long-term memory capabilities in language models through multi-session conversations and targeted queries. It compares a graph-based memory system called Engrama against full-context prompting and a vector-retrieval system, all using the same base model to focus on memory differences. Results indicate that full-context prompting achieves the highest overall score, yet Engrama is the only approach that surpasses it on tasks involving integration across different conversational contexts. This points to structured memory offering specialized benefits for complex reasoning that full history access does not always provide.

Core claim

Engrama, the graph-structured memory system, scores higher than full-context prompting on cross-space reasoning (0.6532 versus 0.6291) while trailing in the composite score (0.5367 versus 0.6186), demonstrating that graph retrieval enables better handling of information integration across sessions.

What carries the argument

Engrama, a graph-structured memory system for organizing and retrieving information from extended conversation histories.

Load-bearing premise

That the 150 queries and five categories accurately isolate memory architecture effects without confounding influences from query phrasing, persona design, or interactions with the shared GPT-4o answering model.

What would settle it

Repeating the evaluation with modified query phrasings or alternative personas that eliminates Engrama's advantage on cross-space reasoning.

Figures

Figures reproduced from arXiv: 2604.21229 by Julian Acuna.

**Figure 1.** Figure 1: High-level overview of Engrama. Conversations are incrementally consolidated into a graphstructured memory organized around entities, semantic spaces, temporal traces, and cross-space associative links (dashed edges). At query time, the system activates a relevant neighborhood rather than replaying the full transcript, converts it into a structured summary, and generates an answer grounded in that region… view at source ↗

**Figure 2.** Figure 2: Per-metric comparison of the three primary systems on [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

read the original abstract

Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper introduces a new multi-category benchmark for long-term LLM memory and reports that graph retrieval beats full context on cross-space tasks, but thin details on query and conversation construction leave the key result open to confounds.

read the letter

The core contribution here is EngramaBench, built from five personas, 100 multi-session conversations, and 150 queries divided into factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. They pit their graph memory system against full-context GPT-4o and the vector memory Mem0, keeping the answering model fixed as GPT-4o. Full context leads overall at 0.6186 while Engrama sits at 0.5367, yet Engrama alone tops full context on the cross-space slice (0.6532 vs 0.6291, n=30). Mem0 trails at 0.4809. The ablations on graph components are useful because they show how features that help one reasoning type hurt the composite score, which is a practical systems point.

Referee Report

2 major / 2 minor

Summary. The paper introduces EngramaBench, a benchmark for long-term conversational memory comprising five personas, 100 multi-session conversations, and 150 queries across five categories (factual recall, cross-space integration, temporal reasoning, adversarial abstention, emergent synthesis). It evaluates Engrama (graph-structured memory) against GPT-4o full-context prompting and Mem0 (vector-retrieval), all using the same GPT-4o answering model to isolate memory architecture effects. Results show full-context GPT-4o with the highest composite score (0.6186), Engrama at 0.5367 overall but outperforming on cross-space reasoning (0.6532 vs. 0.6291, n=30), Mem0 at 0.4809, and ablations indicating component trade-offs between cross-space gains and global performance.

Significance. If the benchmark isolates architecture effects cleanly, the cross-space result would demonstrate a concrete advantage for structured graph retrieval over full context or vector methods on integration tasks, while the ablations usefully expose systems-level tensions in memory design. The paper strengthens its empirical contribution by holding the answering model fixed across systems, reporting specific numerical scores, and including ablations; these elements make the findings more falsifiable and reproducible than typical memory evaluations.

major comments (2)

[Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central claim that Engrama's graph structure produces the only cross-space advantage (0.6532 vs. 0.6291, n=30) is load-bearing, yet no information is supplied on how the 30 cross-space queries were generated, validated for neutrality, or checked against phrasing/persona confounds that could interact with the shared GPT-4o answering model rather than the retrieval architecture.
[§3 (Benchmark Construction)] §3 (Benchmark Construction): The five query categories are presented as cleanly isolating memory effects, but the manuscript supplies no details on query-generation protocol, inter-rater reliability for category assignment, session-order controls, or statistical tests confirming that differences are not driven by query design or GPT-4o interaction patterns; this directly undermines attribution of the 0.0241 cross-space gap to graph structure.

minor comments (2)

[Results tables] Table 1 or equivalent results table: report standard deviations or bootstrap confidence intervals alongside the composite and per-category scores, especially for the n=30 cross-space subset.
[§3] Clarify whether the 100 conversations were generated with fixed session lengths or variable lengths, and whether any filtering was applied before query creation.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and for highlighting the need for greater transparency in query construction. We agree that the current manuscript does not supply sufficient methodological detail on how the cross-space queries and category assignments were produced. Below we respond to each major comment and commit to a major revision that adds the requested information.

read point-by-point responses

Referee: [Abstract and §4 (Experiments)] The central claim that Engrama's graph structure produces the only cross-space advantage (0.6532 vs. 0.6291, n=30) is load-bearing, yet no information is supplied on how the 30 cross-space queries were generated, validated for neutrality, or checked against phrasing/persona confounds that could interact with the shared GPT-4o answering model rather than the retrieval architecture.

Authors: We acknowledge that the manuscript currently provides no explicit account of the generation or validation process for the 30 cross-space queries. This omission weakens the ability to attribute the observed 0.0241 gap solely to the retrieval architecture. In the revised version we will insert a dedicated paragraph (or subsection) in §4 that describes: (1) the protocol used to generate the queries, (2) the steps taken to ensure neutrality and to minimize phrasing or persona confounds, and (3) any manual or automated checks performed to verify that the queries do not preferentially interact with GPT-4o’s internal knowledge rather than the supplied memory. We will also report the exact number of queries per category and any balancing procedures applied. revision: yes
Referee: [§3 (Benchmark Construction)] The five query categories are presented as cleanly isolating memory effects, but the manuscript supplies no details on query-generation protocol, inter-rater reliability for category assignment, session-order controls, or statistical tests confirming that differences are not driven by query design or GPT-4o interaction patterns; this directly undermines attribution of the 0.0241 cross-space gap to graph structure.

Authors: We agree that the absence of these details limits the strength of the causal claim. The current §3 describes the five categories at a high level but does not document the generation protocol, inter-rater reliability, session-order controls, or any statistical checks for query-design confounds. In the revision we will expand §3 with: (a) the full query-generation protocol, (b) the procedure and results of any inter-rater reliability assessment for category assignment, (c) explicit session-order controls, and (d) any post-hoc statistical tests or sensitivity analyses performed to evaluate whether query phrasing or GPT-4o interaction patterns could explain the cross-space result. These additions will directly address the referee’s concern about attribution. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical benchmark with direct measurements

full rationale

The paper presents EngramaBench as an empirical evaluation of three memory systems on 150 queries across five categories, reporting raw performance scores (e.g., cross-space 0.6532 vs. 0.6291) without any derivations, equations, fitted parameters, or self-referential claims. All results are direct measurements from GPT-4o-based runs, isolating architecture effects via controlled comparisons. No load-bearing steps reduce to inputs by construction, self-citations, or ansatzes; the work is self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 1 invented entities

Empirical benchmark paper with no mathematical derivations. The central evaluation assumes the benchmark queries validly measure the intended memory capabilities and that using the same answering model isolates architecture effects.

invented entities (1)

Engrama graph-structured memory system no independent evidence
purpose: Structured retrieval for long-term conversational memory across sessions
New system introduced and evaluated in the paper; no independent evidence provided outside this work.

pith-pipeline@v0.9.0 · 5490 in / 1222 out tokens · 49375 ms · 2026-05-09T22:22:47.575614+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

11 extracted references · 10 canonical work pages · 5 internal anchors

[10]

HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , booktitle =

Guti. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , booktitle =. 2024 , eprint =

2024
[11]

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. https://arxiv.org/abs/2504.19413 Mem0: Building production-ready ai agents with scalable long-term memory . arXiv preprint arXiv:2504.19413

work page internal anchor Pith review arXiv 2025
[12]

Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. 2024. https://arxiv.org/abs/2402.16288 Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering . In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing

work page arXiv 2024
[13]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva N. Mody, Steven Truitt, and Jonathan Larson. 2024. https://arxiv.org/abs/2404.16130 From local to global: A graph rag approach to query-focused summarization . arXiv preprint arXiv:2404.16130

work page internal anchor Pith review arXiv 2024
[14]

Bernal Jim \'e nez Guti \'e rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2024. https://arxiv.org/abs/2405.14831 Hipporag: Neurobiologically inspired long-term memory for large language models . In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS)

work page arXiv 2024
[15]

Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. 2024. https://arxiv.org/abs/2402.09727 A human-inspired reading agent with gist memory of very long contexts . In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 23865--23886

work page arXiv 2024
[16]

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. https://arxiv.org/abs/2402.17753 Evaluating very long-term conversational memory of llm agents . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)

work page internal anchor Pith review arXiv 2024
[17]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2310.08560 Memgpt: Towards llms as operating systems . arXiv preprint arXiv:2310.08560

work page internal anchor Pith review arXiv 2023
[18]

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. 2025. https://arxiv.org/abs/2410.10813 Longmemeval: Benchmarking chat assistants on long-term interactive memory . In Proceedings of the International Conference on Learning Representations (ICLR)

work page internal anchor Pith review arXiv 2025
[19]

Jing Xu, Arthur Szlam, and Jason Weston. 2022. https://arxiv.org/abs/2107.07567 Beyond goldfish memory: Long-term open-domain conversation . In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (ACL), pages 5180--5197

work page arXiv 2022
[20]

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. 2023. https://arxiv.org/abs/2305.10250 Memorybank: Enhancing large language models with long-term memory . arXiv preprint arXiv:2305.10250

work page arXiv 2023