MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

Kuan Wang

arxiv: 2606.29914 · v1 · pith:ECRKBHSRnew · submitted 2026-06-29 · 💻 cs.CL · cs.LG

MemDelta: Controlled Baselines and Hidden Confounds in Agent Memory Evaluation

Kuan Wang This is my paper

Pith reviewed 2026-06-30 06:43 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords agent memory evaluationRAG baselinesembedding confoundscontrolled benchmarksLLM memory systemsLongMemEvalretrieval augmentation

0 comments

The pith

Controlled tests show that embedding model choice and base LLM often determine memory rankings more than the memory method.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MemDelta to evaluate memory systems by changing only one variable at a time on the LongMemEval-S benchmark. It finds that verbatim RAG matches full-context performance for GPT-4o-mini while the advantage reverses for other models, that swapping the embedding model alone shifts accuracy by 6.2 points, and that agent self-memory underperforms basic retrieval. These patterns indicate that many reported gains for complex memory architectures arise from differences in embeddings, models, or refusal rates rather than the memory design itself. The authors therefore advise fixing embeddings across comparisons, stratifying results by model family, and reporting write-path costs.

Core claim

Applying the controlled MemDelta protocol on LongMemEval-S across three model families shows that verbatim RAG matches full-context GPT-4o-mini accuracy (47.2% vs 49.8%), that embedding swaps alone move accuracy by +6.2pp, that self-memory reaches only 42% against 47% for basic retrieval, and that on two of six question types a specialized system matches cloud RAG at fifty times the cost.

What carries the argument

MemDelta, the evaluation protocol that holds all but one pipeline component fixed while measuring accuracy on 500 questions across 50+ sessions.

If this is right

Verbatim RAG matches full-context performance for GPT-4o-mini but Gemini gains 14pp from full context while Sonnet gains 31pp from RAG.
Swapping only the embedding model in an otherwise identical pipeline changes accuracy by +6.2pp at n=500.
Agent self-memory reaches 42% while basic retrieval reaches 47%.
On two of six question types, a specialized memory system matches cloud RAG at 50x the cost.
Memory evaluations should fix embedding models, stratify by model family, and report write-path costs.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Standardizing the embedding model across all compared systems would isolate whether architectural differences actually drive gains.
The model-family reversals imply that refusal rates in full-context settings are a hidden variable in many existing comparisons.
Narrow parity on only two of six question types suggests that claims of general superiority require explicit stratification by query category.
Requiring cost reporting alongside accuracy would make it harder to attribute small gains to expensive memory architectures.

Load-bearing premise

The LongMemEval-S benchmark and its question types are representative enough that single-variable isolations reveal general truths about memory system value.

What would settle it

A replication study on an independent 500-question benchmark in which memory systems still outperform fixed-embedding RAG baselines by large margins across multiple model families would falsify the central claim.

Figures

Figures reproduced from arXiv: 2606.29914 by Kuan Wang.

**Figure 1.** Figure 1: A measured case study: one hidden variable flips the conclusion. The same LongMemEval-S question is processed through three pipelines. Row 1: MiniLM-based RAG retrieves an irrelevant passage and fails. Row 2: Cloud-embedding RAG (identical code, different embedder) retrieves the correct passage and succeeds. Row 3: Mem0’s extraction pipeline succeeds but costs 1,000+ LLM calls. With MiniLM, Mem0 appears to… view at source ↗

**Figure 2.** Figure 2: The MemDelta protocol. Strategy cards connected by labeled comparison edges, each isolating one confounded variable. Top bar: components held fixed across all comparisons. Bottom: the five-step per-question procedure [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

read the original abstract

Agent memory systems are increasingly evaluated against RAG and full-context baselines, but reported gains often mix changes in the memory method with changes in the language model, embedding model, or retrieval pipeline, making it unclear what is actually being measured. We present MemDelta, a controlled evaluation protocol that varies one component at a time on LongMemEval-S (500 questions, 50+ sessions, three model families). Four findings emerge: (1) verbatim RAG matches full-context GPT-4o-mini (47.2% vs. 49.8%, p = 0.34), but the ranking reverses across models: Gemini gains +14pp from full context, while Sonnet gains +31pp from RAG, partly because it refuses 63% of full-context queries; (2) swapping only the embedding model in an identical pipeline shifts accuracy by +6.2pp at n = 500 (p = 0.004), and Mem0 beats MiniLM-RAG by +11pp but loses to cloud-RAG by 1.2pp, so one variable flips the conclusion; (3) agent self-memory (42%) underperforms basic retrieval (47%); (4) on 2 of 6 question types (n = 88), Mem0 matches cloud RAG (72.7% vs. 73.9%, p = 1.0) at 50x the cost, suggesting narrow rather than general gains. We recommend memory evaluations fix embedding models across comparisons, stratify by model family, and report write-path cost before attributing gains to architecture.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemDelta shows embedding swaps and model choice can flip memory eval rankings by 6-14pp on LongMemEval-S, but the benchmark's narrow question set limits how far the confound claims travel.

read the letter

The main takeaway is that this paper documents concrete confounds in agent memory benchmarks through one-at-a-time swaps. Verbatim RAG matches full-context GPT-4o-mini at 47.2% vs 49.8%, yet rankings reverse across Gemini and Sonnet, and swapping only the embedding model moves accuracy by 6.2pp. Those numbers come from controlled runs on the same 500-question set.

What stands out is the protocol itself. By holding the retrieval pipeline fixed and changing one variable, it isolates effects that prior comparisons mixed together. The embedding result and the model-family reversal are new measurements, and the paper reports p-values and sample sizes for them. The cost comparison on two question types (Mem0 matching cloud RAG at 50x cost) is also useful for practitioners.

The soft spot is representativeness. LongMemEval-S uses 500 questions across six types and 50+ sessions, but the abstract gives no detail on how those types were chosen or whether they match real agent workloads. If the questions correlate with the tested models or favor retrieval over self-memory, the advice to always fix embeddings and stratify by model family may not generalize. The self-memory underperforming basic retrieval (42% vs 47%) is interesting but rests on the same benchmark.

This is for people running or designing memory evaluations for agents. It flags a methodological issue with clear examples rather than abstract warnings. The work is coherent on its own terms and shows honest engagement with the evaluation literature.

It deserves peer review so the methods section and data can be checked for selection effects or unstated exclusions.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces MemDelta, a controlled evaluation protocol for agent memory systems that varies one component at a time on the LongMemEval-S benchmark (500 questions, 50+ sessions, three model families). It reports four findings: (1) verbatim RAG matches full-context GPT-4o-mini (47.2% vs. 49.8%, p=0.34) but rankings reverse across models (e.g., Sonnet gains +31pp from RAG, partly due to 63% refusals on full context); (2) swapping only the embedding model shifts accuracy by +6.2pp (p=0.004), with Mem0 beating MiniLM-RAG by +11pp but losing to cloud-RAG by 1.2pp; (3) agent self-memory (42%) underperforms basic retrieval (47%); (4) on 2 of 6 question types (n=88), Mem0 matches cloud RAG (72.7% vs. 73.9%, p=1.0) at 50x cost. The authors recommend fixing embedding models, stratifying by model family, and reporting write-path costs.

Significance. If the controlled empirical results hold, the work is significant for exposing hidden confounds in agent memory evaluations, such as embedding model choice and model-family interactions that can reverse apparent gains from memory architectures. A key strength is the use of one-at-a-time variation with standard statistical tests on reported accuracies, providing clear, falsifiable demonstrations of evaluation pitfalls rather than parameter-fitted derivations.

major comments (2)

[Benchmark and question types description] The central claim that controlled one-at-a-time variation on LongMemEval-S reveals general truths about memory system value and hidden confounds rests on the benchmark's representativeness. The manuscript provides no derivation or validation of the six question types against external agent logs or typical workloads (see the benchmark description), which is load-bearing for generalizing findings (2) and (4) beyond this specific set.
[Results, finding (2)] Finding (2) reports a +6.2pp accuracy shift from embedding model swap alone (p=0.004 at n=500) as evidence of confounds. The methods must explicitly confirm that the retrieval pipeline, prompt, and all other variables were identical across the swap; without this detail, the isolation itself cannot be verified as confound-free.

minor comments (2)

[Abstract] The abstract mentions six question types and specific refusal rates (e.g., 63%) but does not enumerate the types or provide exact counts/conditions; a table or list in the main text would improve clarity and replicability.
A consolidated table summarizing all reported accuracies, p-values, sample sizes, and model comparisons across the four findings would aid quick assessment without requiring cross-referencing the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which help clarify the scope of our claims and the transparency of our methods. We address each major comment below, agreeing where revisions are warranted and providing the strongest honest defense of the manuscript's contributions.

read point-by-point responses

Referee: [Benchmark and question types description] The central claim that controlled one-at-a-time variation on LongMemEval-S reveals general truths about memory system value and hidden confounds rests on the benchmark's representativeness. The manuscript provides no derivation or validation of the six question types against external agent logs or typical workloads (see the benchmark description), which is load-bearing for generalizing findings (2) and (4) beyond this specific set.

Authors: We acknowledge that the manuscript does not derive or externally validate the six question types in LongMemEval-S against real-world agent logs. The paper positions LongMemEval-S as an established benchmark and uses it to demonstrate that even within a fixed benchmark, single-variable changes can reverse conclusions about memory systems. We do not claim the specific numerical results generalize beyond this benchmark. To address the concern, we will add a limitations paragraph in the discussion explicitly qualifying the scope of findings (2) and (4) and noting the absence of external workload validation. revision: partial
Referee: [Results, finding (2)] Finding (2) reports a +6.2pp accuracy shift from embedding model swap alone (p=0.004 at n=500) as evidence of confounds. The methods must explicitly confirm that the retrieval pipeline, prompt, and all other variables were identical across the swap; without this detail, the isolation itself cannot be verified as confound-free.

Authors: The methods already describe performing an embedding-model swap while holding the rest of the pipeline fixed. To make this isolation fully verifiable, we will insert an explicit statement in the experimental setup confirming that retrieval pipeline, prompt templates, chunk size, top-k, and all other variables remained unchanged during the embedding-model comparisons. This matches the actual experimental design. revision: yes

Circularity Check

0 steps flagged

No circularity: purely empirical evaluation with no derivations

full rationale

The paper reports controlled empirical comparisons of memory systems on the LongMemEval-S benchmark, including accuracy percentages, p-values from statistical tests, and model-specific reversals. No equations, derivations, fitted parameters renamed as predictions, or self-citation chains appear in the abstract or described content. All findings (e.g., verbatim RAG matching full-context at 47.2% vs 49.8%, embedding swap effects) are direct measurements against external benchmarks and standard tests, with no reduction of claims to inputs by construction. The analysis is self-contained.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

Relies on the LongMemEval-S dataset and standard statistical significance testing; no free parameters, invented entities, or ad-hoc axioms introduced beyond domain assumptions of benchmark validity.

axioms (1)

standard math Statistical tests with reported p-values correctly identify whether observed accuracy differences are due to chance.
Invoked for all p-value statements such as p=0.34 and p=0.004.

pith-pipeline@v0.9.1-grok · 5847 in / 1152 out tokens · 43142 ms · 2026-06-30T06:43:22.172416+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

14 extracted references · 11 canonical work pages · 5 internal anchors

[1]

arXiv preprint arXiv:2602.16313 , year=

He, Z., et al. MemoryArena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv:2602.16313,

work page arXiv
[2]

Beyond the context window: A cost-performance analysis of fact-based memory vs. long-context llms for persistent agents,

Pollertlam, N. and Kornsuwannawit, W. Beyond the context window: A cost-performance analysis of fact-based memory vs. long-context LLMs for persistent agents.arXiv:2603.04814,

work page arXiv
[3]

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Zhao, Y ., et al. AMA-Bench: Evaluating long-horizon memory for agentic applications. arXiv:2602.22769,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

MemGPT: Towards LLMs as Operating Systems

Packer, C., et al. MemGPT: Towards LLMs as operating systems.arXiv:2310.08560,

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Memory in the Age of AI Agents

Hu, Y ., et al. Memory in the age of AI agents.arXiv:2512.13564,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Graphiti: Temporal knowledge graphs for LLM agents

Mem0: The memory layer for AI agents.https://mem0.ai, 2024–2026. Graphiti: Temporal knowledge graphs for LLM agents. Zep AI, https://github.com/getzep/ graphiti,

2024
[7]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Ai, Q., et al. MemoryBench: A benchmark for memory and continual learning in LLM systems. arXiv:2510.17281,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

Aksoy, S. G., et al. Semantic needles in document haystacks: Sensitivity testing of LLM-as-a-judge similarity scoring.arXiv:2604.18835,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

arXiv preprint arXiv:2502.13595 (2025) https://doi.org/10.48550/arXiv.2502.13595

Enevoldsen, K., et al. MMTEB: Massive multilingual text embedding benchmark.arXiv:2502.13595,

work page arXiv
[10]

Beyond chunk-then-embed: A comprehensive taxonomy and evaluation of document chunking strategies for information retrieval.arXiv:2602.16974,

Zhou, Y ., et al. Beyond chunk-then-embed: A comprehensive taxonomy and evaluation of document chunking strategies for information retrieval.arXiv:2602.16974,

work page arXiv
[11]

and Wang, D

Yang, E. and Wang, D. Benchmark illusion: Disagreement among LLMs and its scientific conse- quences.arXiv:2602.11898,

work page arXiv
[12]

NEBULA: Do we evaluate vision-language-action agents correctly?arXiv:2510.16263,

Peng, J., et al. NEBULA: Do we evaluate vision-language-action agents correctly?arXiv:2510.16263,

work page arXiv
[13]

An hour each way

Telemetry disabled. Each instance requires ∼1,000+ LLM API calls during ingestion (one extraction call per session, plus embedding calls). A.3 LLM Judge All accuracy judgments use GPT-4o-mini with a binary prompt: given the ground-truth answer and the model’s response, output YES if the response contains the correct information, NO otherwise. The judge do...

2024
[14]

Needle-in-a-haystack

compared Mem0 against long- context LLMs, finding a 33pp accuracy gap on LongMemEval and a cost crossover at approximately 10 turns. However, this comparison does not isolate whether the gap comes from extraction, retrieval, embeddings, or model-specific context behavior. Recent work provides mechanistic insight into why full-context baselines are unstabl...

2026

[1] [1]

arXiv preprint arXiv:2602.16313 , year=

He, Z., et al. MemoryArena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv:2602.16313,

work page arXiv

[2] [2]

Beyond the context window: A cost-performance analysis of fact-based memory vs. long-context llms for persistent agents,

Pollertlam, N. and Kornsuwannawit, W. Beyond the context window: A cost-performance analysis of fact-based memory vs. long-context LLMs for persistent agents.arXiv:2603.04814,

work page arXiv

[3] [3]

AMA-Bench: Evaluating Long-Horizon Memory for Agentic Applications

Zhao, Y ., et al. AMA-Bench: Evaluating long-horizon memory for agentic applications. arXiv:2602.22769,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

MemGPT: Towards LLMs as Operating Systems

Packer, C., et al. MemGPT: Towards LLMs as operating systems.arXiv:2310.08560,

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Memory in the Age of AI Agents

Hu, Y ., et al. Memory in the age of AI agents.arXiv:2512.13564,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Graphiti: Temporal knowledge graphs for LLM agents

Mem0: The memory layer for AI agents.https://mem0.ai, 2024–2026. Graphiti: Temporal knowledge graphs for LLM agents. Zep AI, https://github.com/getzep/ graphiti,

2024

[7] [7]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Ai, Q., et al. MemoryBench: A benchmark for memory and continual learning in LLM systems. arXiv:2510.17281,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Semantic Needles in Document Haystacks: Sensitivity Testing of LLM-as-a-Judge Similarity Scoring

Aksoy, S. G., et al. Semantic needles in document haystacks: Sensitivity testing of LLM-as-a-judge similarity scoring.arXiv:2604.18835,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

arXiv preprint arXiv:2502.13595 (2025) https://doi.org/10.48550/arXiv.2502.13595

Enevoldsen, K., et al. MMTEB: Massive multilingual text embedding benchmark.arXiv:2502.13595,

work page arXiv

[10] [10]

Beyond chunk-then-embed: A comprehensive taxonomy and evaluation of document chunking strategies for information retrieval.arXiv:2602.16974,

Zhou, Y ., et al. Beyond chunk-then-embed: A comprehensive taxonomy and evaluation of document chunking strategies for information retrieval.arXiv:2602.16974,

work page arXiv

[11] [11]

and Wang, D

Yang, E. and Wang, D. Benchmark illusion: Disagreement among LLMs and its scientific conse- quences.arXiv:2602.11898,

work page arXiv

[12] [12]

NEBULA: Do we evaluate vision-language-action agents correctly?arXiv:2510.16263,

Peng, J., et al. NEBULA: Do we evaluate vision-language-action agents correctly?arXiv:2510.16263,

work page arXiv

[13] [13]

An hour each way

Telemetry disabled. Each instance requires ∼1,000+ LLM API calls during ingestion (one extraction call per session, plus embedding calls). A.3 LLM Judge All accuracy judgments use GPT-4o-mini with a binary prompt: given the ground-truth answer and the model’s response, output YES if the response contains the correct information, NO otherwise. The judge do...

2024

[14] [14]

Needle-in-a-haystack

compared Mem0 against long- context LLMs, finding a 33pp accuracy gap on LongMemEval and a cost crossover at approximately 10 turns. However, this comparison does not isolate whether the gap comes from extraction, retrieval, embeddings, or model-specific context behavior. Recent work provides mechanistic insight into why full-context baselines are unstabl...

2026