Recognition: unknown
EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
Pith reviewed 2026-05-09 22:22 UTC · model grok-4.3
The pith
Engrama's graph-structured memory outperforms full-context prompting on cross-space reasoning in long-term conversations.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Engrama, the graph-structured memory system, scores higher than full-context prompting on cross-space reasoning (0.6532 versus 0.6291) while trailing in the composite score (0.5367 versus 0.6186), demonstrating that graph retrieval enables better handling of information integration across sessions.
What carries the argument
Engrama, a graph-structured memory system for organizing and retrieving information from extended conversation histories.
Load-bearing premise
That the 150 queries and five categories accurately isolate memory architecture effects without confounding influences from query phrasing, persona design, or interactions with the shared GPT-4o answering model.
What would settle it
Repeating the evaluation with modified query phrasings or alternative personas that eliminates Engrama's advantage on cross-space reasoning.
Figures
read the original abstract
Large language model assistants are increasingly expected to retain and reason over information accumulated across many sessions. We introduce EngramaBench, a benchmark for long-term conversational memory built around five personas, one hundred multi-session conversations, and one hundred fifty queries spanning factual recall, cross-space integration, temporal reasoning, adversarial abstention, and emergent synthesis. We evaluate Engrama, a graph-structured memory system, against GPT-4o full-context prompting and Mem0, an open-source vector-retrieval memory system. All three use the same answering model (GPT-4o), isolating the effect of memory architecture. GPT-4o full-context achieves the highest composite score (0.6186), while Engrama scores 0.5367 globally but is the only system to score higher than full-context prompting on cross-space reasoning (0.6532 vs. 0.6291, n=30). Mem0 is cheapest but substantially weaker (0.4809). Ablations reveal that the components driving Engrama's cross-space advantage trade off against global composite score, exposing a systems-level tension between structured memory specialization and aggregate optimization.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces EngramaBench, a benchmark for long-term conversational memory comprising five personas, 100 multi-session conversations, and 150 queries across five categories (factual recall, cross-space integration, temporal reasoning, adversarial abstention, emergent synthesis). It evaluates Engrama (graph-structured memory) against GPT-4o full-context prompting and Mem0 (vector-retrieval), all using the same GPT-4o answering model to isolate memory architecture effects. Results show full-context GPT-4o with the highest composite score (0.6186), Engrama at 0.5367 overall but outperforming on cross-space reasoning (0.6532 vs. 0.6291, n=30), Mem0 at 0.4809, and ablations indicating component trade-offs between cross-space gains and global performance.
Significance. If the benchmark isolates architecture effects cleanly, the cross-space result would demonstrate a concrete advantage for structured graph retrieval over full context or vector methods on integration tasks, while the ablations usefully expose systems-level tensions in memory design. The paper strengthens its empirical contribution by holding the answering model fixed across systems, reporting specific numerical scores, and including ablations; these elements make the findings more falsifiable and reproducible than typical memory evaluations.
major comments (2)
- [Abstract and §4 (Experiments)] Abstract and §4 (Experiments): The central claim that Engrama's graph structure produces the only cross-space advantage (0.6532 vs. 0.6291, n=30) is load-bearing, yet no information is supplied on how the 30 cross-space queries were generated, validated for neutrality, or checked against phrasing/persona confounds that could interact with the shared GPT-4o answering model rather than the retrieval architecture.
- [§3 (Benchmark Construction)] §3 (Benchmark Construction): The five query categories are presented as cleanly isolating memory effects, but the manuscript supplies no details on query-generation protocol, inter-rater reliability for category assignment, session-order controls, or statistical tests confirming that differences are not driven by query design or GPT-4o interaction patterns; this directly undermines attribution of the 0.0241 cross-space gap to graph structure.
minor comments (2)
- [Results tables] Table 1 or equivalent results table: report standard deviations or bootstrap confidence intervals alongside the composite and per-category scores, especially for the n=30 cross-space subset.
- [§3] Clarify whether the 100 conversations were generated with fixed session lengths or variable lengths, and whether any filtering was applied before query creation.
Simulated Author's Rebuttal
We thank the referee for the careful reading and for highlighting the need for greater transparency in query construction. We agree that the current manuscript does not supply sufficient methodological detail on how the cross-space queries and category assignments were produced. Below we respond to each major comment and commit to a major revision that adds the requested information.
read point-by-point responses
-
Referee: [Abstract and §4 (Experiments)] The central claim that Engrama's graph structure produces the only cross-space advantage (0.6532 vs. 0.6291, n=30) is load-bearing, yet no information is supplied on how the 30 cross-space queries were generated, validated for neutrality, or checked against phrasing/persona confounds that could interact with the shared GPT-4o answering model rather than the retrieval architecture.
Authors: We acknowledge that the manuscript currently provides no explicit account of the generation or validation process for the 30 cross-space queries. This omission weakens the ability to attribute the observed 0.0241 gap solely to the retrieval architecture. In the revised version we will insert a dedicated paragraph (or subsection) in §4 that describes: (1) the protocol used to generate the queries, (2) the steps taken to ensure neutrality and to minimize phrasing or persona confounds, and (3) any manual or automated checks performed to verify that the queries do not preferentially interact with GPT-4o’s internal knowledge rather than the supplied memory. We will also report the exact number of queries per category and any balancing procedures applied. revision: yes
-
Referee: [§3 (Benchmark Construction)] The five query categories are presented as cleanly isolating memory effects, but the manuscript supplies no details on query-generation protocol, inter-rater reliability for category assignment, session-order controls, or statistical tests confirming that differences are not driven by query design or GPT-4o interaction patterns; this directly undermines attribution of the 0.0241 cross-space gap to graph structure.
Authors: We agree that the absence of these details limits the strength of the causal claim. The current §3 describes the five categories at a high level but does not document the generation protocol, inter-rater reliability, session-order controls, or any statistical checks for query-design confounds. In the revision we will expand §3 with: (a) the full query-generation protocol, (b) the procedure and results of any inter-rater reliability assessment for category assignment, (c) explicit session-order controls, and (d) any post-hoc statistical tests or sensitivity analyses performed to evaluate whether query phrasing or GPT-4o interaction patterns could explain the cross-space result. These additions will directly address the referee’s concern about attribution. revision: yes
Circularity Check
No circularity: purely empirical benchmark with direct measurements
full rationale
The paper presents EngramaBench as an empirical evaluation of three memory systems on 150 queries across five categories, reporting raw performance scores (e.g., cross-space 0.6532 vs. 0.6291) without any derivations, equations, fitted parameters, or self-referential claims. All results are direct measurements from GPT-4o-based runs, isolating architecture effects via controlled comparisons. No load-bearing steps reduce to inputs by construction, self-citations, or ansatzes; the work is self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (1)
-
Engrama graph-structured memory system
no independent evidence
Reference graph
Works this paper leans on
-
[10]
HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , booktitle =
Guti. HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models , booktitle =. 2024 , eprint =
2024
-
[11]
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. https://arxiv.org/abs/2504.19413 Mem0: Building production-ready ai agents with scalable long-term memory . arXiv preprint arXiv:2504.19413
work page internal anchor Pith review arXiv 2025
-
[12]
Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. 2024. https://arxiv.org/abs/2402.16288 Perltqa: A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering . In Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing
-
[13]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva N. Mody, Steven Truitt, and Jonathan Larson. 2024. https://arxiv.org/abs/2404.16130 From local to global: A graph rag approach to query-focused summarization . arXiv preprint arXiv:2404.16130
work page internal anchor Pith review arXiv 2024
-
[14]
Bernal Jim \'e nez Guti \'e rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. 2024. https://arxiv.org/abs/2405.14831 Hipporag: Neurobiologically inspired long-term memory for large language models . In Proceedings of the 38th Conference on Neural Information Processing Systems (NeurIPS)
-
[15]
Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. 2024. https://arxiv.org/abs/2402.09727 A human-inspired reading agent with gist memory of very long contexts . In Proceedings of the 41st International Conference on Machine Learning (ICML), pages 23865--23886
-
[16]
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. https://arxiv.org/abs/2402.17753 Evaluating very long-term conversational memory of llm agents . In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (ACL)
work page internal anchor Pith review arXiv 2024
-
[17]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. 2023. https://arxiv.org/abs/2310.08560 Memgpt: Towards llms as operating systems . arXiv preprint arXiv:2310.08560
work page internal anchor Pith review arXiv 2023
-
[18]
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. 2025. https://arxiv.org/abs/2410.10813 Longmemeval: Benchmarking chat assistants on long-term interactive memory . In Proceedings of the International Conference on Learning Representations (ICLR)
work page internal anchor Pith review arXiv 2025
- [19]
- [20]
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.