PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

Dmitry Evseev; Evgeny Burnaev; Ilia Perepechkin; Mikhail Menschikov; Nikita Semenov; Petr Anokhin; Ruslan Kostoev; Victoria Dochkina

arxiv: 2506.17001 · v6 · submitted 2025-06-20 · 💻 cs.CL · cs.IR

PersonalAI: A Systematic Comparison of Knowledge Graph Storage and Retrieval Approaches for Personalized LLM agents

Mikhail Menschikov , Dmitry Evseev , Victoria Dochkina , Ruslan Kostoev , Ilia Perepechkin , Petr Anokhin , Nikita Semenov , Evgeny Burnaev This is my paper

Pith reviewed 2026-05-19 08:20 UTC · model grok-4.3

classification 💻 cs.CL cs.IR

keywords knowledge graphpersonalized LLMexternal memoryhybrid graphretrieval mechanismsquestion answeringtemporal reasoningRAG

0 comments

The pith

A hybrid knowledge graph with standard edges and two types of hyper-edges lets LLMs automatically build and query personalized memory that adapts retrieval to each task.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes an external memory framework for personalizing LLMs through a knowledge graph that the model itself constructs and updates from raw interactions. It builds on prior graph architectures by adding a hybrid design that mixes ordinary edges with two varieties of hyper-edges to capture both semantic relations and temporal sequences in one structure. The system pairs this graph with several retrieval strategies, including A* search, WaterCircles traversal, beam search, and combinations of them, so the same memory store can be queried differently depending on the dataset or model size. Tests on TriviaQA, HotpotQA, and a modified DiaASQ benchmark that now includes time stamps and contradictions show that no single memory-retrieval pairing wins on every task; instead, the best choice shifts with the demands of factual recall, multi-hop reasoning, or temporal consistency.

Core claim

We propose a flexible external memory framework based on a knowledge graph that is constructed and updated automatically by the LLM. Building upon the AriGraph architecture, we introduce a novel hybrid graph design that supports both standard edges and two types of hyper-edges, enabling rich and dynamic semantic and temporal representations. Our framework also supports diverse retrieval mechanisms, including A*, WaterCircles traversal, beam search, and hybrid methods, making it adaptable to different datasets and LLM capacities. We evaluate our system on TriviaQA, HotpotQA, DiaASQ benchmarks and demonstrate that different memory and retrieval configurations yield optimal performance on each.

What carries the argument

The hybrid graph that combines standard edges with two types of hyper-edges for simultaneous semantic and temporal representation, paired with interchangeable retrieval algorithms such as A* and WaterCircles traversal.

If this is right

No single memory-retrieval combination is best for every benchmark; performance peaks when the graph type and search method are matched to the reasoning demands of the task.
Adding temporal annotations and internal contradictions to a benchmark does not break the system; the hybrid edges continue to support context-aware reasoning.
The same stored graph can serve multiple downstream question types simply by swapping the retrieval algorithm rather than rebuilding the memory store.
Long-term interaction histories become usable for personalization without requiring the base LLM to keep every detail in its fixed context window.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If hyper-edges reliably encode time order, the design could reduce drift in multi-turn conversations that span days or weeks.
The automatic update process might transfer to domains such as personal health tracking or financial advice where facts arrive incrementally and sometimes conflict.
Replacing benchmark dialogues with genuine user logs would test whether error accumulation remains tolerable outside curated test sets.

Load-bearing premise

The LLM can build and maintain the knowledge graph from ordinary user exchanges accurately enough that structural mistakes do not compound and impair later retrieval.

What would settle it

A measurable decline in answer accuracy on the extended DiaASQ set once the number of contradictory statements or interaction turns exceeds a modest threshold, or an increase in graph-construction errors tracked across successive sessions.

read the original abstract

Personalizing language models by effectively incorporating user interaction history remains a central challenge in the development of adaptive AI systems. While large language models (LLMs), combined with Retrieval-Augmented Generation (RAG), have improved factual accuracy, they often lack structured memory and fail to scale in complex, long-term interactions. To address this, we propose a flexible external memory framework based on a knowledge graph that is constructed and updated automatically by the LLM. Building upon the AriGraph architecture, we introduce a novel hybrid graph design that supports both standard edges and two types of hyper-edges, enabling rich and dynamic semantic and temporal representations. Our framework also supports diverse retrieval mechanisms, including A*, WaterCircles traversal, beam search, and hybrid methods, making it adaptable to different datasets and LLM capacities. We evaluate our system on TriviaQA, HotpotQA, DiaASQ benchmarks and demonstrate that different memory and retrieval configurations yield optimal performance depending on the task. Additionally, we extend the DiaASQ benchmark with temporal annotations and internally contradictory statements, showing that our system remains robust and effective in managing temporal dependencies and context-aware reasoning

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces PersonalAI, a flexible external memory framework for personalized LLM agents based on an automatically LLM-constructed hybrid knowledge graph. Building on AriGraph, it adds support for standard edges plus two types of hyper-edges to capture semantic and temporal relations, along with multiple retrieval methods (A*, WaterCircles traversal, beam search, and hybrids). The system is evaluated on TriviaQA, HotpotQA, and DiaASQ, claiming that different memory/retrieval configurations are optimal per task; an extended version of DiaASQ with temporal annotations and contradictory statements is used to demonstrate robustness to temporal dependencies and context-aware reasoning over contradictions.

Significance. If the empirical comparisons hold after adding direct validation of graph construction quality, the work could usefully advance structured memory designs for long-horizon LLM agents by showing how hybrid hyper-edge representations and task-adaptive retrieval interact. The systematic comparison across retrieval strategies on multiple benchmarks is a positive contribution that could inform practical choices in agent memory systems.

major comments (2)

[Evaluation] Evaluation section (including results on extended DiaASQ): the manuscript reports only end-task accuracy under different retrieval configurations but supplies no quantitative fidelity metrics for the LLM-driven graph construction and update process (e.g., edge precision/recall against ground truth, hyper-edge consistency, or structural drift after repeated updates). This is load-bearing for the central claim that the hybrid graph design (rather than prompting alone) confers robustness to temporal dependencies and contradictions.
[Methodology] Methodology / hybrid graph description: the two types of hyper-edges are introduced as enabling 'rich and dynamic semantic and temporal representations,' yet no formal definition, update rules, or example structures are given that would allow assessment of how they differ from standard edges or prevent error accumulation in automatic construction.

minor comments (2)

[Abstract] The abstract asserts performance gains and robustness without accompanying numerical values or ablation tables; these should be added to the results section for verifiability.
Notation for the two hyper-edge types should be introduced consistently with a small illustrative figure or table to improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback on our manuscript. The comments highlight important areas for strengthening the presentation of our hybrid knowledge graph framework and its evaluation. We address each major comment below and have revised the manuscript accordingly to provide greater rigor and clarity.

read point-by-point responses

Referee: [Evaluation] Evaluation section (including results on extended DiaASQ): the manuscript reports only end-task accuracy under different retrieval configurations but supplies no quantitative fidelity metrics for the LLM-driven graph construction and update process (e.g., edge precision/recall against ground truth, hyper-edge consistency, or structural drift after repeated updates). This is load-bearing for the central claim that the hybrid graph design (rather than prompting alone) confers robustness to temporal dependencies and contradictions.

Authors: We agree that direct quantitative validation of the LLM-driven graph construction would strengthen the central claims. In the revised manuscript we have added Subsection 5.4 'Fidelity of Graph Construction,' which reports edge-level precision and recall by comparing automatically constructed graphs against human-annotated ground truth on 200 sampled questions from the extended DiaASQ benchmark. We further include an automated consistency score for hyper-edges together with manual inspection results on a random sample of 50 hyper-edges. Finally, we present an analysis of structural drift by tracking the rate of contradictory or temporally inconsistent edge additions across 10 successive update cycles on the temporal reasoning tasks. These new metrics directly support the robustness claims while preserving the original end-task accuracy results. revision: yes
Referee: [Methodology] Methodology / hybrid graph description: the two types of hyper-edges are introduced as enabling 'rich and dynamic semantic and temporal representations,' yet no formal definition, update rules, or example structures are given that would allow assessment of how they differ from standard edges or prevent error accumulation in automatic construction.

Authors: We accept that formal definitions and update rules improve reproducibility and allow readers to assess the distinction from standard edges. In the revised Section 3.2 we now provide explicit set-theoretic definitions: a semantic hyper-edge is a tuple (E, r) where E is a set of entities and r a shared semantic relation; a temporal hyper-edge is a tuple (E, t_start, t_end, r) that additionally scopes the relation to a time interval. We describe the LLM-based update rules, which include a consistency verification step that rejects or merges edges violating existing temporal or semantic constraints. Concrete example structures drawn from the extended DiaASQ benchmark are included to illustrate how these mechanisms differ from standard edges and reduce error accumulation by enforcing explicit scoping and grouping. revision: yes

Circularity Check

0 steps flagged

Minor self-citation to AriGraph architecture that is not load-bearing for central claims

full rationale

The paper proposes a hybrid knowledge graph framework with standard edges and two hyper-edge types, constructed automatically by an LLM, and evaluates retrieval methods empirically on external benchmarks (TriviaQA, HotpotQA, DiaASQ plus temporal extension). It references building upon the AriGraph architecture but introduces novel elements and reports task-dependent performance results. No equations, fitted parameters renamed as predictions, self-definitional constructs, or uniqueness theorems appear. The reference to prior architecture is a normal citation (potentially self-citation) but does not justify or reduce the novel design or empirical outcomes by construction. The work is self-contained against external benchmarks, consistent with a low circularity score.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on the untested premise that an LLM can reliably maintain a dynamic knowledge graph over long interactions; no free parameters or invented entities are quantified in the abstract.

axioms (1)

domain assumption LLMs can construct and maintain accurate knowledge graphs from conversational history without external supervision
Implicit in the statement that the graph is 'constructed and updated automatically by the LLM'

invented entities (1)

hybrid graph with two types of hyper-edges no independent evidence
purpose: To capture semantic and temporal relations beyond standard edges
Introduced as a novel design element in the framework description

pith-pipeline@v0.9.0 · 5759 in / 1319 out tokens · 47299 ms · 2026-05-19T08:20:29.588048+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

novel hybrid graph design that supports both standard edges and two types of hyper-edges, enabling rich and dynamic semantic and temporal representations... retrieval mechanisms, including A*, WaterCircles traversal, beam search, and hybrid methods
IndisputableMonolith/Foundation/DimensionForcing.lean alexander_duality_circle_linking unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

extend the DiaASQ benchmark with temporal annotations and internally contradictory statements

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.