WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation

Harish Santhanalakshmi Ganesan

arxiv: 2604.18478 · v1 · submitted 2026-04-20 · 💻 cs.AI · cs.CL

WorldDB: A Vector Graph-of-Worlds Memory Engine with Ontology-Aware Write-Time Reconciliation

Harish Santhanalakshmi Ganesan This is my paper

Pith reviewed 2026-05-10 04:34 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords persistent memoryknowledge graphsagent memoryvector embeddingswrite-time reconciliationrecursive worldslong-term memoryontology-aware

0 comments

The pith

WorldDB uses recursive world nodes and write-time edge programs to reach 96.4 percent accuracy on long conversational memory tests.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces WorldDB as a memory engine designed to overcome fragmentation and loss of identity in flat vector stores by structuring persistent knowledge as a graph where every node is itself a world containing an interior subgraph, its own ontology scope, and a composed embedding that recurses to arbitrary depth. It establishes that making these nodes content-addressed and immutable creates automatic Merkle-style audit trails on every edit, while treating edges as executable programs with on_insert, on_delete, and on_query_rewrite handlers enables ontology-aware reconciliation such as superseding old facts, preserving contradictions, or staging merges without raw append paths. A sympathetic reader would care because this architecture directly targets the core bottleneck separating stateless chatbots from reliable long-running agentic systems that must track updates, preferences, and temporal changes across hundreds of thousands of tokens.

Core claim

WorldDB is a vector graph-of-worlds memory engine built on three commitments: every node is a world container with its own interior subgraph, ontology scope, and composed embedding recursive to arbitrary depth; nodes are content-addressed and immutable so any edit produces a new hash at the node and every ancestor; edges are write-time programs where each type ships on_insert, on_delete, and on_query_rewrite handlers that implement supersession by closing validity, contradiction by preserving both sides, and same_as by staging merge proposals.

What carries the argument

The recursive world node as a container holding its own subgraph and embedding, paired with edge types that execute write-time handlers for ontology-aware reconciliation instead of simple labels.

Load-bearing premise

That the write-time handlers for supersession, contradiction, and same_as can be defined and executed without introducing new inconsistencies or prohibitive latency in realistic multi-session agent workloads.

What would settle it

A multi-session conversational workload in which executing the edge handlers produces logical inconsistencies among stored facts or causes query latency to grow unacceptably with session count.

read the original abstract

Persistent memory is the bottleneck separating stateless chatbots from long-running agentic systems. Retrieval-augmented generation (RAG) over flat vector stores fragments facts into chunks, loses cross-session identity, and has no first-class notion of supersession or contradiction. Recent bitemporal knowledge-graph systems (Graphiti, Memento, Hydra DB) add typed edges and valid-time metadata, but the graph itself remains flat: no recursive composition, no content-addressed invariants on nodes, and edge types carry no behavior beyond a label. We present WorldDB, a memory engine built on three commitments: (i) every node is a world -- a container with its own interior subgraph, ontology scope, and composed embedding, recursive to arbitrary depth; (ii) nodes are content-addressed and immutable, so any edit produces a new hash at the node and every ancestor, giving a Merkle-style audit trail for free; (iii) edges are write-time programs -- each edge type ships on_insert/on_delete/on_query_rewrite handlers (supersession closes validity, contradicts preserves both sides, same_as stages a merge proposal), so no raw append path exists. On LongMemEval-s (500 questions, ~115k-token conversational stacks), WorldDB with Claude Opus 4.7 as answerer achieves 96.40% overall / 97.11% task-averaged accuracy, a +5.61pp improvement over the previously reported Hydra DB state-of-the-art (90.79%) and +11.20pp over Supermemory (85.20%), with perfect single-session-assistant recall and robust performance on temporal reasoning (96.24%), knowledge update (98.72%), and preference synthesis (96.67%). Ablations show that the engine's graph layer -- resolver-unified entities and typed refers_to edges -- contributes +7.0pp task-averaged independently of the underlying answerer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

WorldDB adds recursive nested worlds, Merkle versioning, and programmable write-time edge handlers to agent memory graphs, with clear benchmark gains, but the handler logic lacks the details needed to verify it stays consistent.

read the letter

WorldDB's core idea is to replace flat vector or bitemporal graphs with recursive world nodes that contain their own subgraphs, use content-addressed immutable storage with automatic ancestor hashing, and attach executable handlers to every edge type. No raw appends are allowed; everything routes through on_insert, on_delete, or on_query_rewrite programs that handle supersession, contradiction, and same_as merges. On LongMemEval-s the system reaches 96.4% overall accuracy with Claude Opus 4.7, a 5.61-point lift over Hydra DB, and the ablation isolates a 7-point contribution from the graph layer alone. That combination of structure and measured improvement is the actual advance over the cited prior work.

Referee Report

2 major / 2 minor

Summary. The paper introduces WorldDB, a memory engine for long-running agentic systems that models memory as a recursive graph of worlds (nodes containing subgraphs, ontology scopes, and composed embeddings), enforces content-addressed immutability with Merkle-style audit trails on edits, and treats edges as programmable write-time handlers (on_insert/on_delete/on_query_rewrite) for supersession, contradiction, and same_as operations. On the LongMemEval-s benchmark (500 questions over ~115k-token stacks), it reports 96.40% overall and 97.11% task-averaged accuracy using Claude Opus 4.7, outperforming Hydra DB by 5.61pp and Supermemory by 11.20pp, with an ablation attributing +7.0pp independently to the graph layer.

Significance. If the handler correctness and implementation details hold, the work could meaningfully advance persistent memory architectures beyond flat RAG or bitemporal KGs by enabling recursive composition, built-in auditability, and behavior-carrying edges. The reported gains on temporal reasoning, knowledge updates, and preference synthesis tasks indicate potential practical impact for multi-session agents, though the absence of methodological details limits immediate assessment of generalizability.

major comments (2)

[architecture description (abstract and §3)] The headline accuracy claims (96.40% overall, +7.0pp from graph layer) and ablation rest on the write-time handlers for supersession, contradiction, and same_as executing correctly without introducing merge errors, validity violations, or query artifacts on ~115k-token stacks. The architecture description states that edges ship on_insert/on_delete/on_query_rewrite programs with no raw append path, yet no pseudocode, invariants, worked examples of same_as merge staging, or Merkle-hash preservation under contradiction are supplied. This is load-bearing for the central empirical result.
[evaluation and ablation sections] No implementation details, error bars, full benchmark methodology, or reproducibility artifacts (e.g., code, exact prompt templates, or handler test cases) are provided, making the soundness of the +5.61pp improvement over Hydra DB unverifiable from the text. The ablation isolating the graph layer is only interpretable if the handlers themselves are shown to be sound.

minor comments (2)

The abstract and text lack discussion of latency or computational overhead introduced by the write-time programs and recursive embedding composition, which is relevant for realistic agent workloads.
Missing references to related work on content-addressed graphs or programmable edges (e.g., beyond Graphiti/Memento/Hydra DB) would strengthen the positioning.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the thorough and constructive review. We address each major comment below and will revise the manuscript accordingly to improve clarity and reproducibility.

read point-by-point responses

Referee: [architecture description (abstract and §3)] The headline accuracy claims (96.40% overall, +7.0pp from graph layer) and ablation rest on the write-time handlers for supersession, contradiction, and same_as executing correctly without introducing merge errors, validity violations, or query artifacts on ~115k-token stacks. The architecture description states that edges ship on_insert/on_delete/on_query_rewrite programs with no raw append path, yet no pseudocode, invariants, worked examples of same_as merge staging, or Merkle-hash preservation under contradiction are supplied. This is load-bearing for the central empirical result.

Authors: We agree that the absence of explicit handler specifications limits verification of the central claims. In the revised manuscript we will expand §3 to include pseudocode for the on_insert, on_delete, and on_query_rewrite handlers of the supersession, contradiction, and same_as edge types. We will also state the invariants that preserve Merkle hashes under these operations and provide a worked example of same_as merge staging, thereby demonstrating that no raw append path exists and that merge errors are prevented. revision: yes
Referee: [evaluation and ablation sections] No implementation details, error bars, full benchmark methodology, or reproducibility artifacts (e.g., code, exact prompt templates, or handler test cases) are provided, making the soundness of the +5.61pp improvement over Hydra DB unverifiable from the text. The ablation isolating the graph layer is only interpretable if the handlers themselves are shown to be sound.

Authors: We acknowledge that the current version omits these details. In the revision we will add error bars from repeated runs, a complete description of the LongMemEval-s evaluation protocol, the exact prompt templates used with Claude Opus 4.7, and unit test cases for each handler. We will also release the implementation and artifacts upon acceptance so that the reported gains and the +7.0pp graph-layer ablation can be independently verified. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical benchmark results are independent of internal derivations

full rationale

The paper describes a memory engine architecture with three design commitments (world nodes, content-addressed immutability, write-time edge handlers) and reports measured accuracy on the external LongMemEval-s benchmark (500 questions, ~115k-token stacks) using Claude Opus 4.7 as the answerer. These performance figures (96.40% overall, +5.61pp over Hydra DB, +7.0pp from graph layer) are obtained via direct experimental evaluation and ablation, not derived from equations, fitted parameters, or self-referential definitions within the paper. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the central claims; the architecture is presented as a set of engineering choices whose correctness is assessed externally against the benchmark rather than reduced to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 2 invented entities

The design rests on standard cryptographic hashing for immutability and on the assumption that ontology scopes and handler semantics can be consistently defined; no free parameters are fitted in the abstract, and the new entities are the world abstraction and programmable edges.

axioms (1)

standard math Content-addressed hashing yields immutable nodes and a Merkle-style audit trail for free
Invoked in the second commitment; relies on well-known properties of hash trees.

invented entities (2)

World node no independent evidence
purpose: Recursive container holding its own subgraph, ontology scope, and composed embedding
Core new abstraction stated in commitment (i); no independent evidence supplied beyond the system description.
Write-time program edge no independent evidence
purpose: Edge type that ships on_insert/on_delete/on_query_rewrite handlers to enforce supersession, contradiction, and merge logic
Core new abstraction stated in commitment (iii); no independent evidence supplied beyond the system description.

pith-pipeline@v0.9.0 · 5655 in / 1541 out tokens · 41569 ms · 2026-05-10T04:34:52.001876+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

TOKI: A Bitemporal Operator Algebra for Contradiction Resolution in LLM-Agent Persistent Memory
cs.DB 2026-06 unverdicted novelty 7.0

TOKI types four common contradiction-resolution heuristics as bitemporal operators on a dual-row schema, supplies soundness theorems, and shows via a verdict matrix that it alone avoids three write-time anomalies whil...

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages · cited by 1 Pith paper

[1]

Model Context Protocol Specification, 2025

Anthropic. Model Context Protocol Specification, 2025. Protocol version 2025-06-18

work page 2025
[2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, 2025

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, 2025. 11

work page 2025
[3]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’09), pages 758–759, Boston, MA, USA, 2009. ACM

work page 2009
[4]

Deep Memory Retrieval (DMR) Benchmark

DMR Benchmark Authors. Deep Memory Retrieval (DMR) Benchmark. Evaluation suite, 2024

work page 2024
[5]

HAKG: Hierarchy-Aware Knowledge Gated Network for Recommendation

Yuntao Du, Xinjun Zhu, Lu Chen, Baihua Zheng, and Yunjun Gao. HAKG: Hierarchy-Aware Knowledge Gated Network for Recommendation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22), pages 1390–1400. ACM, 2022

work page 2022
[6]

Memento: Bitemporal Knowledge Graph Memory for AI Agents

Shane Farkas. Memento: Bitemporal Knowledge Graph Memory for AI Agents. GitHub repository, 2025

work page 2025
[7]

Context Rot: How Increasing Input Tokens Impacts LLM Performance

Kelly Hong, Anton Troynikov, and Jeff Huber. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Technical Report, July 2025

work page 2025
[8]

Matthew A. Jaro. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida.Journal of the American Statistical Association, 84(406):414–420, 1989

work page 1985
[9]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020

work page 2020
[10]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the Middle: How Language Models Use Long Contexts, 2023

work page 2023
[11]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory, 2025

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A Temporal Knowledge Graph Architecture for Agent Memory, 2025

work page 2025
[12]

Hydra DB: Beyond Context Windows for Long-Term Agentic Memory

Soham Ratnaparkhi, Nishkarsh Srivastava, Aadil Garg, Pratham Garg, and Tejas Kumar. Hydra DB: Beyond Context Windows for Long-Term Agentic Memory. Technical report, 2026

work page 2026
[13]

LLMs: Bigger Is Not Always Better

Tony Rigoni. LLMs: Bigger Is Not Always Better. Ampere Computing Blog, 2024

work page 2024
[14]

Incremental Multi-source Entity Resolution for Knowledge Graph Completion

Alieh Saeedi, Eric Peukert, and Erhard Rahm. Incremental Multi-source Entity Resolution for Knowledge Graph Completion. InProceedings of the 17th Extended Semantic Web Conference (ESWC 2020), pages 393–408. Springer, 2020

work page 2020
[15]

State-of-the-Art Agent Memory on LongMemEval

Supermemory. State-of-the-Art Agent Memory on LongMemEval. Supermemory Research, 2026

work page 2026
[16]

summary":

DiWu, HongweiWang, WenhaoYu, YuweiZhang, Kai-WeiChang, andDongYu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory, 2024. 12 A Evaluation Protocol All three prompts used in §7 are reproduced verbatim below so results can be replicated exactly. Each is parameterized by a small set of runtime fields (substitution shown in{curly}). T...

work page 2024
[17]

Check whether the retrieved turns actually mention that subject

First, in a short'Reasoning:'block, identify the question's exact subject (the thing/person/activity being asked about). Check whether the retrieved turns actually mention that subject. If they only mention a similar-but-different thing (e.g. the question asks about'football'but turns only mention'baseball'), say so explicitly - the correct answer in that...

work page
[18]

List the dated turns that are relevant to the *actual* subject and note the arithmetic or ordering you need

work page
[19]

Then give a concise final answer under'Answer:'

work page
[20]

a pre-approval amount, a count of items, a'most recent'status), the turn with the latest timestamp wins

**Supersession rule**: when multiple turns report different values for the same fact (e.g. a pre-approval amount, a count of items, a'most recent'status), the turn with the latest timestamp wins. Earlier turns are historical, not current

work page
[21]

Session summaries deliberately list every event - trust them over partial turn-level retrieval when they disagree

For'how many ...'or'when did I ...'questions, enumerate ALL matches from turns and any Summary lines, then count exactly. Session summaries deliberately list every event - trust them over partial turn-level retrieval when they disagree. Do not invent events and do not double-count

work page
[22]

Never say'I don't know' if there's any hint of interest

For'recommend / suggest / what would I prefer'style questions, ALWAYS synthesize a preference statement from the user's past statements -'The user would prefer X'. Never say'I don't know' if there's any hint of interest

work page
[23]

Answer:" (case-insensitive), and return everything after it as the final answer string. If no

Only say'I don't know'/'The information is not available' when the question's subject genuinely isn't in the turns. Retrieved turns: {turns_block} Question: {question} Reasoning: Post-processing: parse the response, locate the last occurrence of"Answer:" (case-insensitive), and return everything after it as the final answer string. If no"Answer:" marker i...

work page
[24]

The information provided is not enough

Reference admits ignorance.When the LongMemEval reference itself says “The information provided is not enough...”, a generated"I don’t know" must score CORRECT. Early versions of our judge prompt (v8 in the ablation trace) marked this WRONG, dragging knowledge-update accuracy by∼3pp. Rule three is the explicit fix

work page
[25]

GPS system not functioning correctly

Paraphrase tolerance.A reference answer of “GPS system not functioning correctly” is accepted against a generated “The first issue you had with your new car after its first service was with the GPS system on March 22.”

work page
[26]

$185” is rejected against a generated “$65

Numeric strictness.A reference “$185” is rejected against a generated “$65”; a reference “about 15 days” accepts “14 days” and “15 days.” Rule four makes the difference between an approximation judged present in the reference and an arbitrary rounding by the model. Why not an ensemble judge?We ran pilot comparisons with a GPT-4o judge in parallel on a 100...

work page
[27]

A200ms answerer call multiplies by every hop of a multi-step query

LLM latency and cost are incompatible with query-path SLOs. A200ms answerer call multiplies by every hop of a multi-step query. At a500ms P95 target there is no budget for it

work page
[28]

The reconciler’s guarantees rely on handler code being a function of input state; an LLM call breaks that, and the failure mode is silent

LLM non-determinism corrupts the ontology. The reconciler’s guarantees rely on handler code being a function of input state; an LLM call breaks that, and the failure mode is silent. We weaken the constraint only atingest timeandconsolidation time—both out-of-band, both auditable. The engine’sExtractor trait and Summarizer trait each accept a caller-suppli...

work page

[1] [1]

Model Context Protocol Specification, 2025

Anthropic. Model Context Protocol Specification, 2025. Protocol version 2025-06-18

work page 2025

[2] [2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, 2025

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, 2025. 11

work page 2025

[3] [3]

Cormack, Charles L

Gordon V. Cormack, Charles L. A. Clarke, and Stefan Büttcher. Reciprocal Rank Fusion Outperforms Condorcet and Individual Rank Learning Methods. InProceedings of the 32nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’09), pages 758–759, Boston, MA, USA, 2009. ACM

work page 2009

[4] [4]

Deep Memory Retrieval (DMR) Benchmark

DMR Benchmark Authors. Deep Memory Retrieval (DMR) Benchmark. Evaluation suite, 2024

work page 2024

[5] [5]

HAKG: Hierarchy-Aware Knowledge Gated Network for Recommendation

Yuntao Du, Xinjun Zhu, Lu Chen, Baihua Zheng, and Yunjun Gao. HAKG: Hierarchy-Aware Knowledge Gated Network for Recommendation. InProceedings of the 45th International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’22), pages 1390–1400. ACM, 2022

work page 2022

[6] [6]

Memento: Bitemporal Knowledge Graph Memory for AI Agents

Shane Farkas. Memento: Bitemporal Knowledge Graph Memory for AI Agents. GitHub repository, 2025

work page 2025

[7] [7]

Context Rot: How Increasing Input Tokens Impacts LLM Performance

Kelly Hong, Anton Troynikov, and Jeff Huber. Context Rot: How Increasing Input Tokens Impacts LLM Performance. Chroma Technical Report, July 2025

work page 2025

[8] [8]

Matthew A. Jaro. Advances in Record-Linkage Methodology as Applied to Matching the 1985 Census of Tampa, Florida.Journal of the American Statistical Association, 84(406):414–420, 1989

work page 1985

[9] [9]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks, 2020

work page 2020

[10] [10]

Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the Middle: How Language Models Use Long Contexts, 2023

work page 2023

[11] [11]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory, 2025

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A Temporal Knowledge Graph Architecture for Agent Memory, 2025

work page 2025

[12] [12]

Hydra DB: Beyond Context Windows for Long-Term Agentic Memory

Soham Ratnaparkhi, Nishkarsh Srivastava, Aadil Garg, Pratham Garg, and Tejas Kumar. Hydra DB: Beyond Context Windows for Long-Term Agentic Memory. Technical report, 2026

work page 2026

[13] [13]

LLMs: Bigger Is Not Always Better

Tony Rigoni. LLMs: Bigger Is Not Always Better. Ampere Computing Blog, 2024

work page 2024

[14] [14]

Incremental Multi-source Entity Resolution for Knowledge Graph Completion

Alieh Saeedi, Eric Peukert, and Erhard Rahm. Incremental Multi-source Entity Resolution for Knowledge Graph Completion. InProceedings of the 17th Extended Semantic Web Conference (ESWC 2020), pages 393–408. Springer, 2020

work page 2020

[15] [15]

State-of-the-Art Agent Memory on LongMemEval

Supermemory. State-of-the-Art Agent Memory on LongMemEval. Supermemory Research, 2026

work page 2026

[16] [16]

summary":

DiWu, HongweiWang, WenhaoYu, YuweiZhang, Kai-WeiChang, andDongYu. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory, 2024. 12 A Evaluation Protocol All three prompts used in §7 are reproduced verbatim below so results can be replicated exactly. Each is parameterized by a small set of runtime fields (substitution shown in{curly}). T...

work page 2024

[17] [17]

Check whether the retrieved turns actually mention that subject

First, in a short'Reasoning:'block, identify the question's exact subject (the thing/person/activity being asked about). Check whether the retrieved turns actually mention that subject. If they only mention a similar-but-different thing (e.g. the question asks about'football'but turns only mention'baseball'), say so explicitly - the correct answer in that...

work page

[18] [18]

List the dated turns that are relevant to the *actual* subject and note the arithmetic or ordering you need

work page

[19] [19]

Then give a concise final answer under'Answer:'

work page

[20] [20]

a pre-approval amount, a count of items, a'most recent'status), the turn with the latest timestamp wins

**Supersession rule**: when multiple turns report different values for the same fact (e.g. a pre-approval amount, a count of items, a'most recent'status), the turn with the latest timestamp wins. Earlier turns are historical, not current

work page

[21] [21]

Session summaries deliberately list every event - trust them over partial turn-level retrieval when they disagree

For'how many ...'or'when did I ...'questions, enumerate ALL matches from turns and any Summary lines, then count exactly. Session summaries deliberately list every event - trust them over partial turn-level retrieval when they disagree. Do not invent events and do not double-count

work page

[22] [22]

Never say'I don't know' if there's any hint of interest

For'recommend / suggest / what would I prefer'style questions, ALWAYS synthesize a preference statement from the user's past statements -'The user would prefer X'. Never say'I don't know' if there's any hint of interest

work page

[23] [23]

Answer:" (case-insensitive), and return everything after it as the final answer string. If no

Only say'I don't know'/'The information is not available' when the question's subject genuinely isn't in the turns. Retrieved turns: {turns_block} Question: {question} Reasoning: Post-processing: parse the response, locate the last occurrence of"Answer:" (case-insensitive), and return everything after it as the final answer string. If no"Answer:" marker i...

work page

[24] [24]

The information provided is not enough

Reference admits ignorance.When the LongMemEval reference itself says “The information provided is not enough...”, a generated"I don’t know" must score CORRECT. Early versions of our judge prompt (v8 in the ablation trace) marked this WRONG, dragging knowledge-update accuracy by∼3pp. Rule three is the explicit fix

work page

[25] [25]

GPS system not functioning correctly

Paraphrase tolerance.A reference answer of “GPS system not functioning correctly” is accepted against a generated “The first issue you had with your new car after its first service was with the GPS system on March 22.”

work page

[26] [26]

$185” is rejected against a generated “$65

Numeric strictness.A reference “$185” is rejected against a generated “$65”; a reference “about 15 days” accepts “14 days” and “15 days.” Rule four makes the difference between an approximation judged present in the reference and an arbitrary rounding by the model. Why not an ensemble judge?We ran pilot comparisons with a GPT-4o judge in parallel on a 100...

work page

[27] [27]

A200ms answerer call multiplies by every hop of a multi-step query

LLM latency and cost are incompatible with query-path SLOs. A200ms answerer call multiplies by every hop of a multi-step query. At a500ms P95 target there is no budget for it

work page

[28] [28]

The reconciler’s guarantees rely on handler code being a function of input state; an LLM call breaks that, and the failure mode is silent

LLM non-determinism corrupts the ontology. The reconciler’s guarantees rely on handler code being a function of input state; an LLM call breaks that, and the failure mode is silent. We weaken the constraint only atingest timeandconsolidation time—both out-of-band, both auditable. The engine’sExtractor trait and Summarizer trait each accept a caller-suppli...

work page