DMF: A Deterministic Memory Framework for Conversational AI Agents

Enrico Zimuel; Matteo Stabile

arxiv: 2606.03463 · v1 · pith:SPUXM4DPnew · submitted 2026-06-02 · 💻 cs.AI · cs.CL

DMF: A Deterministic Memory Framework for Conversational AI Agents

Matteo Stabile , Enrico Zimuel This is my paper

Pith reviewed 2026-06-28 10:13 UTC · model grok-4.3

classification 💻 cs.AI cs.CL

keywords deterministic memoryconversational AIsurvival scoretoken efficiencymemory pruningclassical NLPAI agentssemantic relevance

0 comments

The pith

DMF replaces LLM-based memory summarization with a deterministic pipeline of classical signals and a survival score to match accuracy at 5x-242x lower token cost.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces the Deterministic Memory Framework as an alternative to current conversational AI memory systems that depend on large language models for summarization at write time. DMF computes a Survival Score for each interaction from fixed content signals, conversational cues, and provenance data, then applies an interaction-count decay to decide what to keep or prune. The central goal is to remove all generative steps from the memory loop so that token costs for context preparation fall to zero while long-horizon coherence is preserved. A reader would care because existing approaches incur escalating, non-deterministic costs that limit how far agents can converse before memory becomes impractical.

Core claim

DMF assigns each conversational interaction a Survival Score Ω computed from deterministic content signals, conversational cues, and structured provenance combined through a logistic projection; an interaction-count decay law Ω_eff(Δn) then governs how relevance evolves with newer turns, enabling a fully deterministic recall and pruning pipeline that eliminates LLM calls from memory management while matching accuracy on LoCoMo and LongMemEval benchmarks.

What carries the argument

The Survival Score Ω, formed by logistic projection of deterministic signals and updated by interaction-count decay Ω_eff(Δn).

If this is right

Memory context preparation requires zero LLM tokens.
Overall conversation token usage falls by factors between 5 and 242 compared with Mem0.
Pruning decisions become fully deterministic and traceable to explicit signals rather than opaque model outputs.
Long interaction horizons remain feasible without escalating generative costs.
The memory layer can run entirely on CPU without model inference.
pith_inferences=[

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same signal-plus-decay structure could be tested on non-conversational agent memory tasks such as tool-use histories or planning traces.
If the logistic projection proves stable across domains, hybrid systems might use DMF for routine retention and reserve LLMs only for rare edge-case summarization.
The interaction-count decay (rather than wall-clock time) suggests the framework could transfer directly to batch or offline agent logs where timing is irregular.
Keywords from the paper itself plus standard phrases: deterministic memory, conversational AI, survival score, token efficiency, memory pruning, classical NLP, AI agents, semantic relevance.

Load-bearing premise

That deterministic content signals, conversational cues, and structured provenance can be combined through a logistic projection into a Survival Score that reliably tracks semantic relevance without any generative model.

What would settle it

Running the same benchmark conversations with DMF pruning versus LLM summarization and measuring whether accuracy on downstream recall or coherence tasks drops below the LLM baseline by a statistically detectable margin.

Figures

Figures reproduced from arXiv: 2606.03463 by Enrico Zimuel, Matteo Stabile.

**Figure 1.** Figure 1: DMF runtime pipeline. 4 NLP Feature Extraction For each interaction text t, the NLP engine extracts three scalar content signals and a structured conversational-signal envelope, with no LLM involvement. The scalar signals drive the content component of the Survival Score; the structured envelope is consumed by scoring, pruning, card projection, and retrieval. 4.1 Information Density Information density ID … view at source ↗

**Figure 2.** Figure 2: DMF vs. Mem0 metrics using LoCoMo On the temporal reasoning group of LoCoMo, DMF performs 4× better than Mem0 that is known to face challenges with temporal reasoning3 . DMF achieves better performance in this setting because it preserves absolute timestamps and conversational order as part of both the memory representation and the final prompt. In contrast, Mem0 tends to expose synthesized memories, in wh… view at source ↗

**Figure 3.** Figure 3: DMF vs. Mem0 metrics using LongMemEval-10 [PITH_FULL_IMAGE:figures/full_fig_p017_3.png] view at source ↗

read the original abstract

Conversational AI agents require memory systems that are both scalable and semantically coherent across long interaction horizons. Existing approaches rely predominantly on large language model (LLM)-based summarisation at write time, which introduces non-determinism, escalating token costs, and opacity in pruning decisions. We present the Deterministic Memory Framework (DMF), a CPU-first approach that replaces generative memory compression with a fully deterministic pipeline grounded in classical NLP analysis, vector geometry, and mathematical scoring. DMF assigns each conversational interaction a Survival Score $\Omega$ computed from deterministic content signals, conversational cues, and structured provenance, combined through a logistic projection. An interaction-count decay law, denoted as $\Omega_{\mathrm{eff}}(\Delta n)$, governs how relevance evolves as new turns arrive, where $\Delta n$ is the number of newer interactions rather than wall-clock time, preserving full determinism. We present the mathematical formulation of DMF, its structured recall pipeline, the pruning decision procedure, and the evaluation protocol. Experiments are conducted on a purpose-built benchmark using the LoCoMo and LongMemEval datasets. We compare DMF against Mem0, a popular memory layer for AI agents. DMF achieves comparable accuracy while using zero tokens to prepare the memory context and 5x to 242x fewer tokens over the entire conversation. These results show that it is possible to eliminate LLM calls from the memory-management loop, reducing token costs to nearly zero and enabling deterministic memory systems for conversational AI agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DMF offers a deterministic memory pipeline using classical signals and a Survival Score, but its semantic tracking rests only on end-to-end benchmark parity.

read the letter

DMF is a deterministic memory framework that replaces LLM summarization with classical NLP signals, vector geometry, and a logistic Survival Score Ω, then applies an interaction-count decay Ω_eff(Δn). The main result is comparable accuracy to Mem0 on LoCoMo and LongMemEval while using zero tokens for memory preparation and 5x to 242x fewer tokens overall.

The concrete new element is the named combination: content signals plus conversational cues plus provenance projected logistically into the score, with decay driven by newer interaction count rather than wall time. The paper also spells out the recall pipeline and pruning rule. This setup keeps everything CPU-only and fully deterministic, which directly tackles token cost and predictability for long-horizon agents.

The work is practical. Anyone shipping agent memory layers cares about these operating costs, and the zero-token prep claim is a clear operational win if it holds.

The soft spot is validation of the score itself. We see only end-to-end accuracy parity; there are no ablations on signal contributions, no correlation against external relevance labels, and no failure-case breakdowns. The logistic coefficients are free parameters, so if they were tuned on the same benchmarks the circularity concern stands and the score may be fitting surface cues rather than semantics. The abstract states the formulation exists but supplies no equations or parameter values, leaving the robustness hard to judge.

This paper is for engineers who need lower-cost, deterministic memory alternatives rather than theorists chasing new LLM capabilities. A reader testing memory layers would get usable ideas from the pipeline even if they have to add their own checks.

It deserves peer review. The deployment problem is real and the approach is straightforward to test further.

Referee Report

3 major / 1 minor

Summary. The manuscript introduces the Deterministic Memory Framework (DMF) as a CPU-first, fully deterministic alternative to LLM-based memory management for conversational agents. DMF computes a Survival Score Ω for each interaction from deterministic content signals, conversational cues, and structured provenance via logistic projection, applies an interaction-count decay law Ω_eff(Δn), and uses the resulting scores for structured recall and pruning. On LoCoMo and LongMemEval benchmarks, DMF is reported to match the accuracy of Mem0 while incurring zero tokens for memory-context preparation and 5×–242× fewer tokens overall.

Significance. If the Survival Score can be shown to track semantic relevance independently of the evaluation benchmarks, DMF would offer a concrete route to scalable, low-cost, and fully reproducible memory systems that eliminate non-determinism and token overhead from the memory loop. The explicit mathematical formulation and emphasis on a reproducible evaluation protocol are strengths that distinguish the work from purely empirical LLM-memory papers.

major comments (3)

[Mathematical formulation] Mathematical formulation section: the logistic projection that produces Ω is described at a high level, yet the coefficients are free parameters whose values and derivation (first-principles versus fit to benchmark data) are not supplied; without this, the claim that Ω tracks semantic relevance rather than surface cues cannot be evaluated.
[Evaluation protocol] Evaluation protocol and results sections: the accuracy-parity claim rests on end-to-end benchmark numbers alone; no ablation of the individual signal components, no correlation of Ω with external human relevance labels, and no failure-case analysis are reported, leaving open whether the deterministic pipeline selects memories for semantic reasons or for cue statistics.
[Results] Token-usage comparison: the headline 5×–242× reduction is load-bearing for the practical contribution, yet the manuscript provides neither per-dataset breakdowns nor explicit accounting of how Mem0’s token counts (including any LLM calls for summarization) were measured across full conversations.

minor comments (1)

[Abstract] The abstract states that equations and an evaluation protocol are presented, but the provided text supplies neither concrete coefficient values nor error bars; adding these would improve reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thoughtful review and constructive comments. We address each major comment in turn below.

read point-by-point responses

Referee: [Mathematical formulation] Mathematical formulation section: the logistic projection that produces Ω is described at a high level, yet the coefficients are free parameters whose values and derivation (first-principles versus fit to benchmark data) are not supplied; without this, the claim that Ω tracks semantic relevance rather than surface cues cannot be evaluated.

Authors: We agree that the specific coefficients and their derivation were not sufficiently detailed in the original manuscript. In the revised version, we will include the exact values of the coefficients used in the logistic projection and explain their derivation from first-principles analysis of the content signals and conversational cues, without fitting to benchmark data. This will allow readers to evaluate the independence from surface cues. revision: yes
Referee: [Evaluation protocol] Evaluation protocol and results sections: the accuracy-parity claim rests on end-to-end benchmark numbers alone; no ablation of the individual signal components, no correlation of Ω with external human relevance labels, and no failure-case analysis are reported, leaving open whether the deterministic pipeline selects memories for semantic reasons or for cue statistics.

Authors: The manuscript emphasizes end-to-end performance on established benchmarks to demonstrate practical utility. However, we acknowledge the value of additional analyses. In revision, we will add ablations on the contribution of individual signal components and include a failure-case analysis. Regarding correlation with human relevance labels, we did not collect such labels in this study as the focus was on deterministic reproducibility; we will discuss this limitation and suggest it as future work. revision: partial
Referee: [Results] Token-usage comparison: the headline 5×–242× reduction is load-bearing for the practical contribution, yet the manuscript provides neither per-dataset breakdowns nor explicit accounting of how Mem0’s token counts (including any LLM calls for summarization) were measured across full conversations.

Authors: We will provide per-dataset token usage breakdowns in the revised manuscript. For the Mem0 token counts, they were measured by counting all tokens used in LLM calls for memory operations during the full conversation simulations on the benchmarks, including summarization steps. We will add an explicit section detailing the measurement protocol to ensure reproducibility. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation remains self-contained

full rationale

The DMF Survival Score Ω is defined via explicit logistic projection of deterministic signals (content, cues, provenance) and an interaction-count decay Ω_eff(Δn). No equation or section shows these weights or the projection itself being fitted to the LoCoMo/LongMemEval benchmarks used for final accuracy reporting; the formulation is presented as a fixed mathematical construction independent of the evaluation data. End-to-end accuracy comparisons therefore test an externally specified scoring rule rather than a quantity defined by the test outcomes. No self-citation chain, self-definitional loop, or renaming of fitted results appears in the provided derivation steps. The token-reduction claim follows directly from the absence of LLM calls in the memory pipeline, which is independent of the score's semantic fidelity.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The framework rests on the unproven premise that classical signals plus logistic projection suffice for relevance; no free parameters are enumerated in the abstract but the logistic step implies at least scale and bias terms.

free parameters (1)

logistic projection coefficients
Weights that combine content signals, cues, and provenance into Ω are not derived from first principles and are therefore treated as fitted.

axioms (1)

domain assumption Deterministic content signals, conversational cues, and provenance can be linearly combined and passed through a logistic function to produce a relevance score that tracks human judgment of importance.
This is the central modeling choice stated in the abstract for computing Ω.

pith-pipeline@v0.9.1-grok · 5787 in / 1352 out tokens · 29438 ms · 2026-06-28T10:13:16.372994+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

27 extracted references · 12 canonical work pages · 10 internal anchors

[1]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana et al. “Evaluating Very Long-Term Conversational Memory of LLM Agents”. In:arXiv preprint arXiv:2402.17753(2024).URL:https://arxiv.org/abs/2402.17753

work page internal anchor Pith review Pith/arXiv arXiv 2024
[2]

Di Wu et al.LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. 2025. arXiv: 2410.10813 [cs.CL].URL:https://arxiv.org/abs/2410.10813

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara et al. “Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory”. In: arXiv preprint arXiv:2504.19413(2025).URL:https://arxiv.org/abs/2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Matthew Honnibal et al.spaCy: Industrial-strength Natural Language Processing in Python. Zenodo. 2020.DOI: 10.5281/zenodo.1212303.URL:https://spacy.io

work page doi:10.5281/zenodo.1212303.url:https://spacy.io 2020
[5]

V ADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text

Clayton J. Hutto and Eric E. Gilbert. “V ADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text”. In:Proceedings of the 8th International Conference on Weblogs and Social Media (ICWSM). AAAI Press, 2014, pp. 216–225.URL: https://ojs.aaai.org/index.php/ICWSM/article/view/14550

2014
[6]

MemGPT: Towards LLMs as Operating Systems

Charles Packer et al. “MemGPT: Towards LLMs as Operating Systems”. In:arXiv preprint arXiv:2310.08560 (2023).URL:https://arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

A-MEM: Agentic Memory for LLM Agents

Wujiang Li et al. “A-MEM: Agentic Memory for LLM Agents”. In:arXiv preprint arXiv:2502.12110(2025). URL:https://arxiv.org/abs/2502.12110

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727,

Kuang-Huei Lee et al. “ReadAgent: A System for Getting Better LLM Responses to Long Input Documents Using Memory and Retrieval”. In:arXiv preprint arXiv:2402.09727(2024).URL: https://arxiv.org/abs/ 2402.09727

work page arXiv 2024
[9]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

Wanjun Zhong et al. “MemoryBank: Enhancing Large Language Models with Long-Term Memory”. In:arXiv preprint arXiv:2305.10250(2023).URL:https://arxiv.org/abs/2305.10250

work page internal anchor Pith review Pith/arXiv arXiv 2023
[10]

English trans- lation:Memory: A Contribution to Experimental Psychology, Teachers College, Columbia University, 1913

Hermann Ebbinghaus.Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. English trans- lation:Memory: A Contribution to Experimental Psychology, Teachers College, Columbia University, 1913. Leipzig: Duncker & Humblot, 1885

1913
[11]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. In:Advances in Neural Information Processing Systems (NeurIPS). V ol. 33. 2020, pp. 9459–9474.URL: https://arxiv.org/ abs/2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020
[12]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao et al. “Retrieval-Augmented Generation for Large Language Models: A Survey”. In:arXiv preprint arXiv:2312.10997(2023).URL:https://arxiv.org/abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv 2023
[13]

https://github.com/langchain- ai/ langmem

LangChain AI.LangMem: Long-Term Memory for LLM Agents. https://github.com/langchain- ai/ langmem. 2024

2024
[14]

https://github.com/ langchain-ai/langchain

Harrison Chase.LangChain: Building Applications with LLMs through Composability. https://github.com/ langchain-ai/langchain. 2023

2023
[15]

The Atomic Components of Thought

John R. Anderson and Christian Lebiere. “The Atomic Components of Thought”. In:Lawrence Erlbaum Associates(1998). ACT-R cognitive architecture; see also Anderson, J.R. (1983).The Architecture of Cognition. Harvard University Press

1998
[16]

https://github

Qdrant Team.FastEmbed: Fast, Accurate, and Lightweight Python Library for Embeddings. https://github. com/qdrant/fastembed. Python library wrapping ONNX-based embedding models for CPU-efficient infer- ence. 2023

2023
[17]

https://www.trychroma.com

Chroma Team.ChromaDB: The Open-Source Embedding Database. https://www.trychroma.com. Open- source vector database used as the default LTM backend in DMF. 2023

2023
[18]

C-Pack: Packed Resources For General Chinese Embeddings

Shitao Xiao et al. “C-Pack: Packaged Resources To Advance General Chinese Embedding”. In:arXiv preprint arXiv:2309.07597(2023). Source of the BAAI/bge-small-en-v1.5 embedding model family used by DMF. URL:https://arxiv.org/abs/2309.07597

work page internal anchor Pith review Pith/arXiv arXiv 2023
[19]

Aaditya Singh et al.OpenAI GPT-5 System Card. 2026. arXiv: 2601.03267 [cs.CL].URL: https://arxiv. org/abs/2601.03267. 5https://master-data-analytics.it/ 19 APREPRINT- JUNE3, 2026 A Appendix This appendix reports the prompt templates used in the benchmark pipeline to evaluate DMF against Mem0. The templates are shared across frameworks; only the memory con...

work page internal anchor Pith review Pith/arXiv arXiv 2026
[20]

Getting 1 out of 2, 2 out of 4, etc

**PARTIAL CREDIT**: If the generated answer includes AT LEAST ONE correct item from the gold answer’s list, mark CORRECT. Getting 1 out of 2, 2 out of 4, etc. is always acceptable. Only mark WRONG if NONE of the gold answer items appear. 20 APREPRINT- JUNE3, 2026

2026
[21]

Judge semantic meaning, not exact wording

**PARAPHRASES COUNT**: Same concept in different words is CORRECT. Judge semantic meaning, not exact wording
[22]

Never penalize for being more detailed or specific

**EXTRA DETAIL IS FINE**: A longer answer that includes the gold answer’s key facts plus additional information is CORRECT. Never penalize for being more detailed or specific
[23]

Durations within 50% are CORRECT

**DATE TOLERANCE**: Dates within 14 days of each other are CORRECT. Durations within 50% are CORRECT. Relative dates that point to the same time window are CORRECT
[24]

**ABSTENTION MATCHING**: If the gold answer is an abstention or indicates the information is unavailable, any semantically equivalent refusal to answer is CORRECT
[25]

Different wording, phrasing, or level of detail should not result in WRONG if the underlying concept matches

**SEMANTIC OVERLAP**: Judge whether the generated answer addresses the same topic and captures the core idea of the gold answer. Different wording, phrasing, or level of detail should not result in WRONG if the underlying concept matches
[26]

**SAME REFERENT**: If the generated answer identifies the same named entity, person, character, place, or concept as the gold answer, mark CORRECT even if it gives a different description or extra detail
[27]

reasoning

**FOCUS ON KNOWLEDGE, NOT WORDING**: The goal is to assess whether the system recalled the right fact. Minor differences in specificity, phrasing, or scope should not result in WRONG. Only mark WRONG when the generated answer demonstrates a genuinely different or incorrect understanding. ## ONLY mark WRONG if: - The generated answer contains ZERO correct ...

[1] [1]

Evaluating Very Long-Term Conversational Memory of LLM Agents

Adyasha Maharana et al. “Evaluating Very Long-Term Conversational Memory of LLM Agents”. In:arXiv preprint arXiv:2402.17753(2024).URL:https://arxiv.org/abs/2402.17753

work page internal anchor Pith review Pith/arXiv arXiv 2024

[2] [2]

Di Wu et al.LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. 2025. arXiv: 2410.10813 [cs.CL].URL:https://arxiv.org/abs/2410.10813

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara et al. “Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory”. In: arXiv preprint arXiv:2504.19413(2025).URL:https://arxiv.org/abs/2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

Matthew Honnibal et al.spaCy: Industrial-strength Natural Language Processing in Python. Zenodo. 2020.DOI: 10.5281/zenodo.1212303.URL:https://spacy.io

work page doi:10.5281/zenodo.1212303.url:https://spacy.io 2020

[5] [5]

V ADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text

Clayton J. Hutto and Eric E. Gilbert. “V ADER: A Parsimonious Rule-based Model for Sentiment Analysis of Social Media Text”. In:Proceedings of the 8th International Conference on Weblogs and Social Media (ICWSM). AAAI Press, 2014, pp. 216–225.URL: https://ojs.aaai.org/index.php/ICWSM/article/view/14550

2014

[6] [6]

MemGPT: Towards LLMs as Operating Systems

Charles Packer et al. “MemGPT: Towards LLMs as Operating Systems”. In:arXiv preprint arXiv:2310.08560 (2023).URL:https://arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

A-MEM: Agentic Memory for LLM Agents

Wujiang Li et al. “A-MEM: Agentic Memory for LLM Agents”. In:arXiv preprint arXiv:2502.12110(2025). URL:https://arxiv.org/abs/2502.12110

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

A human-inspired reading agent with gist memory of very long contexts.arXiv preprint arXiv:2402.09727,

Kuang-Huei Lee et al. “ReadAgent: A System for Getting Better LLM Responses to Long Input Documents Using Memory and Retrieval”. In:arXiv preprint arXiv:2402.09727(2024).URL: https://arxiv.org/abs/ 2402.09727

work page arXiv 2024

[9] [9]

MemoryBank: Enhancing Large Language Models with Long-Term Memory

Wanjun Zhong et al. “MemoryBank: Enhancing Large Language Models with Long-Term Memory”. In:arXiv preprint arXiv:2305.10250(2023).URL:https://arxiv.org/abs/2305.10250

work page internal anchor Pith review Pith/arXiv arXiv 2023

[10] [10]

English trans- lation:Memory: A Contribution to Experimental Psychology, Teachers College, Columbia University, 1913

Hermann Ebbinghaus.Über das Gedächtnis: Untersuchungen zur experimentellen Psychologie. English trans- lation:Memory: A Contribution to Experimental Psychology, Teachers College, Columbia University, 1913. Leipzig: Duncker & Humblot, 1885

1913

[11] [11]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

Patrick Lewis et al. “Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks”. In:Advances in Neural Information Processing Systems (NeurIPS). V ol. 33. 2020, pp. 9459–9474.URL: https://arxiv.org/ abs/2005.11401

work page internal anchor Pith review Pith/arXiv arXiv 2020

[12] [12]

Retrieval-Augmented Generation for Large Language Models: A Survey

Yunfan Gao et al. “Retrieval-Augmented Generation for Large Language Models: A Survey”. In:arXiv preprint arXiv:2312.10997(2023).URL:https://arxiv.org/abs/2312.10997

work page internal anchor Pith review Pith/arXiv arXiv 2023

[13] [13]

https://github.com/langchain- ai/ langmem

LangChain AI.LangMem: Long-Term Memory for LLM Agents. https://github.com/langchain- ai/ langmem. 2024

2024

[14] [14]

https://github.com/ langchain-ai/langchain

Harrison Chase.LangChain: Building Applications with LLMs through Composability. https://github.com/ langchain-ai/langchain. 2023

2023

[15] [15]

The Atomic Components of Thought

John R. Anderson and Christian Lebiere. “The Atomic Components of Thought”. In:Lawrence Erlbaum Associates(1998). ACT-R cognitive architecture; see also Anderson, J.R. (1983).The Architecture of Cognition. Harvard University Press

1998

[16] [16]

https://github

Qdrant Team.FastEmbed: Fast, Accurate, and Lightweight Python Library for Embeddings. https://github. com/qdrant/fastembed. Python library wrapping ONNX-based embedding models for CPU-efficient infer- ence. 2023

2023

[17] [17]

https://www.trychroma.com

Chroma Team.ChromaDB: The Open-Source Embedding Database. https://www.trychroma.com. Open- source vector database used as the default LTM backend in DMF. 2023

2023

[18] [18]

C-Pack: Packed Resources For General Chinese Embeddings

Shitao Xiao et al. “C-Pack: Packaged Resources To Advance General Chinese Embedding”. In:arXiv preprint arXiv:2309.07597(2023). Source of the BAAI/bge-small-en-v1.5 embedding model family used by DMF. URL:https://arxiv.org/abs/2309.07597

work page internal anchor Pith review Pith/arXiv arXiv 2023

[19] [19]

Aaditya Singh et al.OpenAI GPT-5 System Card. 2026. arXiv: 2601.03267 [cs.CL].URL: https://arxiv. org/abs/2601.03267. 5https://master-data-analytics.it/ 19 APREPRINT- JUNE3, 2026 A Appendix This appendix reports the prompt templates used in the benchmark pipeline to evaluate DMF against Mem0. The templates are shared across frameworks; only the memory con...

work page internal anchor Pith review Pith/arXiv arXiv 2026

[20] [20]

Getting 1 out of 2, 2 out of 4, etc

**PARTIAL CREDIT**: If the generated answer includes AT LEAST ONE correct item from the gold answer’s list, mark CORRECT. Getting 1 out of 2, 2 out of 4, etc. is always acceptable. Only mark WRONG if NONE of the gold answer items appear. 20 APREPRINT- JUNE3, 2026

2026

[21] [21]

Judge semantic meaning, not exact wording

**PARAPHRASES COUNT**: Same concept in different words is CORRECT. Judge semantic meaning, not exact wording

[22] [22]

Never penalize for being more detailed or specific

**EXTRA DETAIL IS FINE**: A longer answer that includes the gold answer’s key facts plus additional information is CORRECT. Never penalize for being more detailed or specific

[23] [23]

Durations within 50% are CORRECT

**DATE TOLERANCE**: Dates within 14 days of each other are CORRECT. Durations within 50% are CORRECT. Relative dates that point to the same time window are CORRECT

[24] [24]

**ABSTENTION MATCHING**: If the gold answer is an abstention or indicates the information is unavailable, any semantically equivalent refusal to answer is CORRECT

[25] [25]

Different wording, phrasing, or level of detail should not result in WRONG if the underlying concept matches

**SEMANTIC OVERLAP**: Judge whether the generated answer addresses the same topic and captures the core idea of the gold answer. Different wording, phrasing, or level of detail should not result in WRONG if the underlying concept matches

[26] [26]

**SAME REFERENT**: If the generated answer identifies the same named entity, person, character, place, or concept as the gold answer, mark CORRECT even if it gives a different description or extra detail

[27] [27]

reasoning

**FOCUS ON KNOWLEDGE, NOT WORDING**: The goal is to assess whether the system recalled the right fact. Minor differences in specificity, phrasing, or scope should not result in WRONG. Only mark WRONG when the generated answer demonstrates a genuinely different or incorrect understanding. ## ONLY mark WRONG if: - The generated answer contains ZERO correct ...