WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation
Pith reviewed 2026-05-10 04:34 UTC · model grok-4.3
The pith
WorldDB uses recursive world nodes and write-time edge programs to reach 96.4 percent accuracy on long conversational memory tests.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
WorldDB is a vector graph-of-worlds memory engine built on three commitments: every node is a world container with its own interior subgraph, ontology scope, and composed embedding recursive to arbitrary depth; nodes are content-addressed and immutable so any edit produces a new hash at the node and every ancestor; edges are write-time programs where each type ships on_insert, on_delete, and on_query_rewrite handlers that implement supersession by closing validity, contradiction by preserving both sides, and same_as by staging merge proposals.
What carries the argument
The recursive world node as a container holding its own subgraph and embedding, paired with edge types that execute write-time handlers for ontology-aware reconciliation instead of simple labels.
Load-bearing premise
That the write-time handlers for supersession, contradiction, and same_as can be defined and executed without introducing new inconsistencies or prohibitive latency in realistic multi-session agent workloads.
What would settle it
A multi-session conversational workload in which executing the edge handlers produces logical inconsistencies among stored facts or causes query latency to grow unacceptably with session count.
read the original abstract
Persistent memory is the bottleneck separating stateless chatbots from long-running agentic systems. Retrieval-augmented generation (RAG) over flat vector stores fragments facts into chunks, loses cross-session identity, and has no first-class notion of supersession or contradiction. Recent bitemporal knowledge-graph systems (Graphiti, Memento, Hydra DB) add typed edges and valid-time metadata, but the graph itself remains flat: no recursive composition, no content-addressed invariants on nodes, and edge types carry no behavior beyond a label. We present WorldDB, a memory engine built on three commitments: (i) every node is a world -- a container with its own interior subgraph, ontology scope, and composed embedding, recursive to arbitrary depth; (ii) nodes are content-addressed and immutable, so any edit produces a new hash at the node and every ancestor, giving a Merkle-style audit trail for free; (iii) edges are write-time programs -- each edge type ships on_insert/on_delete/on_query_rewrite handlers (supersession closes validity, contradicts preserves both sides, same_as stages a merge proposal), so no raw append path exists. On LongMemEval-s (500 questions, ~115k-token conversational stacks), WorldDB with Claude Opus 4.7 as answerer achieves 96.40% overall / 97.11% task-averaged accuracy, a +5.61pp improvement over the previously reported Hydra DB state-of-the-art (90.79%) and +11.20pp over Supermemory (85.20%), with perfect single-session-assistant recall and robust performance on temporal reasoning (96.24%), knowledge update (98.72%), and preference synthesis (96.67%). Ablations show that the engine's graph layer -- resolver-unified entities and typed refers_to edges -- contributes +7.0pp task-averaged independently of the underlying answerer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces WorldDB, a memory engine for long-running agentic systems that models memory as a recursive graph of worlds (nodes containing subgraphs, ontology scopes, and composed embeddings), enforces content-addressed immutability with Merkle-style audit trails on edits, and treats edges as programmable write-time handlers (on_insert/on_delete/on_query_rewrite) for supersession, contradiction, and same_as operations. On the LongMemEval-s benchmark (500 questions over ~115k-token stacks), it reports 96.40% overall and 97.11% task-averaged accuracy using Claude Opus 4.7, outperforming Hydra DB by 5.61pp and Supermemory by 11.20pp, with an ablation attributing +7.0pp independently to the graph layer.
Significance. If the handler correctness and implementation details hold, the work could meaningfully advance persistent memory architectures beyond flat RAG or bitemporal KGs by enabling recursive composition, built-in auditability, and behavior-carrying edges. The reported gains on temporal reasoning, knowledge updates, and preference synthesis tasks indicate potential practical impact for multi-session agents, though the absence of methodological details limits immediate assessment of generalizability.
major comments (2)
- [architecture description (abstract and §3)] The headline accuracy claims (96.40% overall, +7.0pp from graph layer) and ablation rest on the write-time handlers for supersession, contradiction, and same_as executing correctly without introducing merge errors, validity violations, or query artifacts on ~115k-token stacks. The architecture description states that edges ship on_insert/on_delete/on_query_rewrite programs with no raw append path, yet no pseudocode, invariants, worked examples of same_as merge staging, or Merkle-hash preservation under contradiction are supplied. This is load-bearing for the central empirical result.
- [evaluation and ablation sections] No implementation details, error bars, full benchmark methodology, or reproducibility artifacts (e.g., code, exact prompt templates, or handler test cases) are provided, making the soundness of the +5.61pp improvement over Hydra DB unverifiable from the text. The ablation isolating the graph layer is only interpretable if the handlers themselves are shown to be sound.
minor comments (2)
- The abstract and text lack discussion of latency or computational overhead introduced by the write-time programs and recursive embedding composition, which is relevant for realistic agent workloads.
- Missing references to related work on content-addressed graphs or programmable edges (e.g., beyond Graphiti/Memento/Hydra DB) would strengthen the positioning.
Simulated Author's Rebuttal
We thank the referee for the thorough and constructive review. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.
read point-by-point responses
-
Referee: [architecture description (abstract and §3)] The headline accuracy claims (96.40% overall, +7.0pp from graph layer) and ablation rest on the write-time handlers for supersession, contradiction, and same_as executing correctly without introducing merge errors, validity violations, or query artifacts on ~115k-token stacks. The architecture description states that edges ship on_insert/on_delete/on_query_rewrite programs with no raw append path, yet no pseudocode, invariants, worked examples of same_as merge staging, or Merkle-hash preservation under contradiction are supplied. This is load-bearing for the central empirical result.
Authors: We agree that the absence of explicit handler specifications limits verification of the central claims. In the revised manuscript we will expand §3 to include pseudocode for the on_insert, on_delete, and on_query_rewrite handlers of the supersession, contradiction, and same_as edge types. We will also state the invariants that preserve Merkle hashes under these operations and provide a worked example of same_as merge staging, thereby demonstrating that no raw append path exists and that merge errors are prevented. revision: yes
-
Referee: [evaluation and ablation sections] No implementation details, error bars, full benchmark methodology, or reproducibility artifacts (e.g., code, exact prompt templates, or handler test cases) are provided, making the soundness of the +5.61pp improvement over Hydra DB unverifiable from the text. The ablation isolating the graph layer is only interpretable if the handlers themselves are shown to be sound.
Authors: We acknowledge that the current version omits these details. In the revision we will add error bars from repeated runs, a complete description of the LongMemEval-s evaluation protocol, the exact prompt templates used with Claude Opus 4.7, and unit test cases for each handler. We will also release the implementation and artifacts upon acceptance so that the reported gains and the +7.0pp graph-layer ablation can be independently verified. revision: yes
Circularity Check
No circularity: empirical benchmark results are independent of internal derivations
full rationale
The paper describes a memory engine architecture with three design commitments (world nodes, content-addressed immutability, write-time edge handlers) and reports measured accuracy on the external LongMemEval-s benchmark (500 questions, ~115k-token stacks) using Claude Opus 4.7 as the answerer. These performance figures (96.40% overall, +5.61pp over Hydra DB, +7.0pp from graph layer) are obtained via direct experimental evaluation and ablation, not derived from equations, fitted parameters, or self-referential definitions within the paper. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the central claims; the architecture is presented as a set of engineering choices whose correctness is assessed externally against the benchmark rather than reduced to its own inputs by construction.
Axiom & Free-Parameter Ledger
axioms (1)
- standard math Content-addressed hashing yields immutable nodes and a Merkle-style audit trail for free
invented entities (2)
-
World node
no independent evidence
-
Write-time program edge
no independent evidence
Forward citations
Cited by 1 Pith paper
-
TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory
TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies whil...
Reference graph
Works this paper leans on
-
[1]
Model Context Protocol Specification, 2025
Anthropic. Model Context Protocol Specification, 2025. Protocol version 2025-06-18
work page 2025
-
[2]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, 2025
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, 2025. 11
work page 2025
-
[3]
Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’09), pages 758–759, Boston, MA, USA, 2009. ACM
work page 2009
-
[4]
Deep Memory Retrieval (DMR) Benchmark
DMR Benchmark Authors. Deep Memory Retrieval (DMR) Benchmark. Evaluation suite, 2024
work page 2024
-
[5]
HAKG: Hierarchy-Aware Knowledge Gated Network for Recommendation
Yuntao Du, Xinjun Zhu, Lu Chen, Baihua Zheng, and Yunjun Gao. HAKG: Hierarchy-Aware Knowledge Gated Network for Recommendation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22), pages 1390–1400. ACM, 2022
work page 2022
-
[6]
Memento: Bitemporal Knowledge Graph Memory for AI Agents
Shane Farkas. Memento: Bitemporal Knowledge Graph Memory for AI Agents. GitHub repository, 2025
work page 2025
-
[7]
Context Rot: How Increasing Input Tokens Impacts LLM Performance
Kelly Hong, Anton Troynikov, and Jeff Huber. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Technical Report, July 2025
work page 2025
-
[8]
Matthew A. Jaro. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida.Journal of the American Statistical Association, 84(406):414–420, 1989
work page 1985
-
[9]
Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020
Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020
work page 2020
-
[10]
Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the Middle: How Language Models Use Long Contexts, 2023
work page 2023
-
[11]
Zep: A Temporal Knowledge Graph Architecture for Agent Memory, 2025
Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A Temporal Knowledge Graph Architecture for Agent Memory, 2025
work page 2025
-
[12]
Hydra DB: Beyond Context Windows for Long-Term Agentic Memory
Soham Ratnaparkhi, Nishkarsh Srivastava, Aadil Garg, Pratham Garg, and Tejas Kumar. Hydra DB: Beyond Context Windows for Long-Term Agentic Memory. Technical report, 2026
work page 2026
-
[13]
LLMs: Bigger Is Not Always Better
Tony Rigoni. LLMs: Bigger Is Not Always Better. Ampere Computing Blog, 2024
work page 2024
-
[14]
Incremental Multi-source Entity Resolution for Knowledge Graph Completion
Alieh Saeedi, Eric Peukert, and Erhard Rahm. Incremental Multi-source Entity Resolution for Knowledge Graph Completion. InProceedings of the 17th Extended Semantic Web Conference (ESWC 2020), pages 393–408. Springer, 2020
work page 2020
-
[15]
State-of-the-Art Agent Memory on LongMemEval
Supermemory. State-of-the-Art Agent Memory on LongMemEval. Supermemory Research, 2026
work page 2026
-
[16]
DiWu, HongweiWang, WenhaoYu, YuweiZhang, Kai-WeiChang, andDongYu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory, 2024. 12 A Evaluation Protocol All three prompts used in §7 are reproduced verbatim below so results can be replicated exactly. Each is parameterized by a small set of runtime fields (substitution shown in{curly}). T...
work page 2024
-
[17]
Check whether the retrieved turns actually mention that subject
First, in a short'Reasoning:'block, identify the question's exact subject (the thing/person/activity being asked about). Check whether the retrieved turns actually mention that subject. If they only mention a similar-but-different thing (e.g. the question asks about'football'but turns only mention'baseball'), say so explicitly - the correct answer in that...
-
[18]
List the dated turns that are relevant to the *actual* subject and note the arithmetic or ordering you need
-
[19]
Then give a concise final answer under'Answer:'
-
[20]
**Supersession rule**: when multiple turns report different values for the same fact (e.g. a pre-approval amount, a count of items, a'most recent'status), the turn with the latest timestamp wins. Earlier turns are historical, not current
-
[21]
For'how many ...'or'when did I ...'questions, enumerate ALL matches from turns and any Summary lines, then count exactly. Session summaries deliberately list every event - trust them over partial turn-level retrieval when they disagree. Do not invent events and do not double-count
-
[22]
Never say'I don't know' if there's any hint of interest
For'recommend / suggest / what would I prefer'style questions, ALWAYS synthesize a preference statement from the user's past statements -'The user would prefer X'. Never say'I don't know' if there's any hint of interest
-
[23]
Answer:" (case-insensitive), and return everything after it as the final answer string. If no
Only say'I don't know'/'The information is not available' when the question's subject genuinely isn't in the turns. Retrieved turns: {turns_block} Question: {question} Reasoning: Post-processing: parse the response, locate the last occurrence of"Answer:" (case-insensitive), and return everything after it as the final answer string. If no"Answer:" marker i...
-
[24]
The information provided is not enough
Reference admits ignorance.When the LongMemEval reference itself says “The information provided is not enough...”, a generated"I don’t know" must score CORRECT. Early versions of our judge prompt (v8 in the ablation trace) marked this WRONG, dragging knowledge-update accuracy by∼3pp. Rule three is the explicit fix
-
[25]
GPS system not functioning correctly
Paraphrase tolerance.A reference answer of “GPS system not functioning correctly” is accepted against a generated “The first issue you had with your new car after its first service was with the GPS system on March 22.”
-
[26]
$185” is rejected against a generated “$65
Numeric strictness.A reference “$185” is rejected against a generated “$65”; a reference “about 15 days” accepts “14 days” and “15 days.” Rule four makes the difference between an approximation judged present in the reference and an arbitrary rounding by the model. Why not an ensemble judge?We ran pilot comparisons with a GPT-4o judge in parallel on a 100...
-
[27]
A200ms answerer call multiplies by every hop of a multi-step query
LLM latency and cost are incompatible with query-path SLOs. A200ms answerer call multiplies by every hop of a multi-step query. At a500ms P95 target there is no budget for it
-
[28]
LLM non-determinism corrupts the ontology. The reconciler’s guarantees rely on handler code being a function of input state; an LLM call breaks that, and the failure mode is silent. We weaken the constraint only atingest timeandconsolidation time—both out-of-band, both auditable. The engine’sExtractor trait and Summarizer trait each accept a caller-suppli...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.