pith. sign in

arxiv: 2606.06448 · v1 · pith:7Y7Q3UROnew · submitted 2026-06-04 · 💻 cs.AI

Agent Memory: Characterization and System Implications of Stateful Long-Horizon Workloads

Pith reviewed 2026-06-28 01:07 UTC · model grok-4.3

classification 💻 cs.AI
keywords agent memoryLLM agentssystems characterizationlong-horizon tasksmemory profilingcost tradeoffsstateful workloadsbenchmark suites
0
0 comments X

The pith

A phase-aware profiling of ten agent memory systems shows design choices shift costs between write and read paths, producing ten system recommendations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the first systems characterization of agent memory for LLM agents on long-horizon tasks that require persistent storage and retrieval across sessions. It introduces a taxonomy that classifies systems along four axes and a profiling harness that attributes costs to the distinct phases of construction, retrieval, and generation. Characterization across ten representative systems and two benchmark suites reveals how specific design choices move costs from the write path to the read path and back. This matters because scaling stateful agents depends on understanding and controlling these phase-specific costs rather than treating memory as a uniform black box. The work ends by extracting ten recommendations that address construction scheduling, capability floors, query-volume amortization, freshness-latency balances, and fleet-scale management.

Core claim

By classifying agent memory systems along four axes and applying a phase-aware profiling harness that isolates costs in construction, retrieval, and generation, the study of ten representative systems on two benchmarks demonstrates that design choices systematically shift cost between the write and read paths, from which ten concrete system recommendations follow.

What carries the argument

A system-oriented taxonomy along four axes combined with a phase-aware profiling harness that attributes cost to construction, retrieval, and generation phases.

If this is right

  • Construction scheduling can be adjusted to lower total system cost.
  • Capability floors set minimum performance requirements for deployed memory.
  • Higher query volumes amortize fixed construction costs across more operations.
  • Freshness-latency tradeoffs must be explicitly managed in system design.
  • Fleet-scale management becomes necessary once many agents share memory resources.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The four-axis taxonomy could be used to classify new memory systems as they appear without re-running the full profiling harness.
  • The observed write-read cost shifts may guide hardware or storage-layer choices for hosting long-running agents.
  • Recommendations on amortization imply that low-volume agent workloads may require different designs than high-volume ones.
  • Extending the profiling harness to measure end-to-end task success rather than isolated phase costs could connect system metrics to agent capability.

Load-bearing premise

The ten chosen systems and two benchmark suites are sufficient to expose the dominant cost tradeoffs across the wider space of agent memory designs.

What would settle it

Profiling a substantially larger collection of systems or additional benchmarks and finding cost-shift patterns that diverge from those observed in the original ten systems would falsify the claimed generality of the characterization.

Figures

Figures reproduced from arXiv: 2606.06448 by Alex Pentland, Marian Verhelst, Robin Geens, Thierry Tambe, Tsachy Weissman, Yasmine Omri, Zachary Broveak, Zexue He, Ziyu Gan.

Figure 1
Figure 1. Figure 1: Agent memory gives rise to short-term working memory and long-term memory: the agent retrieves relevant long-term state into its active context, updates memory after interaction, and maintains stored knowledge over time. Abstract LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persis￾ten… view at source ↗
Figure 2
Figure 2. Figure 2: Long-context prompting vs. external agent memory. Per-query serving latency (retrieval + generation, excluding construction) on LongMemEval_S_*. Remote con￾struction via OpenAI API, Local construction via vLLM. completion tokens, embedding input tokens, and number of embedded sequences. Calls are tagged with the active phase and the chunk, window, turn, or query index that triggered them. Hardware telemetr… view at source ↗
Figure 4
Figure 4. Figure 4: Energy per correct answer. Normalizing total energy by correct answers jointly prices construction and serving against task quality. The spread across agent memory systems exceeds 47×. a question we address in Sec. 4.6. This section quantifies the energy component, which is the unhideable cost the operator pays to deploy each agent memory system. All experiments presented in this section were measured and … view at source ↗
Figure 3
Figure 3. Figure 3: Phase cost breakdown. Paradigm III and IV agent memory systems shift the majority of end-to-end energy into construction, which is invisible to the user at query time. TABLE 3: End-to-end cost summary on LongMemEval (Qwen3-32B, n = 300 queries). Construct + 300 QA. Agent memory system Acc. Wall Calls Total kJ J/correct BM25 47.0 16.3m 300 582 4,128 GraphRAG 46.0 1.83h 3,215 2,082 15,084 HippoRAG v2 44.3 44… view at source ↗
Figure 5
Figure 5. Figure 5: Construction call structure and token de￾composition. Embedding batching ratios split sharply by taxonomy paradigm: Paradigm III.a generates large-batch offline-indexing traffic; Paradigms III.b and IV generate sequential per-event traffic on the write-loop critical path. The construction cost identified in Sec. 4.2 has a specific computational shape that follows from how each paradigm transforms interacti… view at source ↗
Figure 6
Figure 6. Figure 6: Construction-LLM sensitivity. QA LLM is fixed to GPT-4o-mini. Embed is fixed to Text Embedding 3 Small. Construction LLM is swept for LLM-dependent systems. well-formed JSON schemas and legal tool-call syntax; a model that cannot reliably satisfy these contracts produces a corrupted store from which the QA model can recover no useful evidence. Insight 4: Construction-LLM downscaling is available for most s… view at source ↗
Figure 7
Figure 7. Figure 7: (a) Construction–serve–accuracy frontier. Fast construction, fast per-query serving, and high accuracy cannot be jointly maximized. Accuracy and cost are macro-averaged over all MemoryAgentBench datasets (remote serving, GPT-4o-mini, text-embedding-3-small). (b) Performance of agent memory systems on various task categories in the MemoryAgentBench suite. 0 100 200 300 400 BM25 embedRAG GraphRAG HippoRAG_v2… view at source ↗
Figure 8
Figure 8. Figure 8: Construction scheduling and per-session write latency under session arrivals (MemoryArena, physics split, 20 multi-session tasks). Under asynchronous scheduling, slow-construction agent memory systems serve queries against stale memory because prior session writes have not yet committed. Per-session write latency spans five orders of magnitude, from sub-millisecond for BM25 to tens of seconds for Paradigm … view at source ↗
Figure 10
Figure 10. Figure 10: Effective time-to-first-token. Pre-answer la￾tency spans two orders of magnitude on identical hardware. The dominant variable is retrieval pipeline depth, not LLM serving speed. compaction or summarization policies to prevent unbounded cost escalation. Because all evaluated systems accumulate state monotonically by default, operators must add independent pruning or forgetting policies to bound fleet-scale… view at source ↗
Figure 11
Figure 11. Figure 11: QA tail latency by agent memory system. Fixed-depth pipelines (Paradigm II) have p95/p50 near 1.3×. Systems with more complex store structures or query￾adaptive pipelines display wider tails at QA time. reasoning, tool-use, or refinement steps. External iteration caps or timeouts are therefore required to bound worst-case cost. Recommendation 10: Latency-sensitive deployments should treat worst-case laten… view at source ↗
read the original abstract

LLM agents are increasingly deployed on long-horizon tasks requiring sustained reasoning over extended interaction histories. Realizing this at scale requires agents to persistently store, retrieve, and update their own memory across sessions. A rich ecosystem of agent memory systems has emerged spanning flat retrieval, LLM-mediated extraction, consolidating fact stores, and agentic control flows. Yet, their system-level behavior remains uncharacterized. We present the first systems characterization of agent memory. First, we introduce a system-oriented taxonomy classifying agent memory systems along four axes. Second, we build a phase-aware profiling harness attributing cost to construction, retrieval, and generation. Third, we characterize ten representative systems across two benchmark suites, uncovering how design choices shift cost across the write and read paths. Finally, we derive 10 system recommendations covering construction scheduling, capability floors, amortization via query volume, freshness-latency tradeoffs, and fleet-scale management.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents the first systems characterization of agent memory for LLM agents on long-horizon tasks. It introduces a four-axis taxonomy for classifying agent memory systems, develops a phase-aware profiling harness to attribute costs to construction, retrieval, and generation phases, empirically evaluates ten representative systems across two benchmark suites to reveal cost shifts between write and read paths, and derives ten system recommendations regarding construction scheduling, capability floors, amortization, freshness-latency tradeoffs, and fleet-scale management.

Significance. If the empirical findings hold, the work provides timely and actionable insights into an emerging area of LLM agent infrastructure where persistent memory is required for sustained reasoning. The phase-aware cost attribution and derivation of concrete recommendations represent strengths that could directly influence system design choices. The empirical measurement study approach is appropriate for the claims made.

major comments (2)
  1. [Section 5] Section 5 (Characterization of ten systems): The selection of the ten representative systems is presented without an explicit coverage analysis or justification relative to the four-axis taxonomy from Section 3. If the chosen systems do not adequately sample quadrants involving high-frequency updates or agentic control flows, the observed cost shifts across write/read paths and the generalization to ten system recommendations cannot be treated as dominant for the broader design space.
  2. [Section 5.2] Section 5.2 (Benchmark suites): The two benchmark suites are used to drive the characterization, but the manuscript provides no discussion of how these suites were selected or their coverage of long-horizon task distributions relative to other potential workloads. This is load-bearing for claims about general system implications and cost tradeoffs.
minor comments (2)
  1. [Abstract] The abstract and introduction could more explicitly name the two benchmark suites and the four taxonomy axes to improve readability for readers unfamiliar with the agent memory ecosystem.
  2. Figure captions and axis labels in the cost attribution plots should include units and error bar definitions for clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments. We address each major comment below and will revise the manuscript to incorporate explicit justifications as outlined.

read point-by-point responses
  1. Referee: [Section 5] Section 5 (Characterization of ten systems): The selection of the ten representative systems is presented without an explicit coverage analysis or justification relative to the four-axis taxonomy from Section 3. If the chosen systems do not adequately sample quadrants involving high-frequency updates or agentic control flows, the observed cost shifts across write/read paths and the generalization to ten system recommendations cannot be treated as dominant for the broader design space.

    Authors: We agree that an explicit coverage analysis was not provided in Section 5. The ten systems were selected to span the design space outlined in the four-axis taxonomy of Section 3, but we will add a dedicated table and accompanying text in the revision that maps each system to the axes. This will explicitly demonstrate coverage, including of high-frequency update and agentic control flow quadrants, thereby supporting the generalizability of the cost-shift observations and recommendations. revision: yes

  2. Referee: [Section 5.2] Section 5.2 (Benchmark suites): The two benchmark suites are used to drive the characterization, but the manuscript provides no discussion of how these suites were selected or their coverage of long-horizon task distributions relative to other potential workloads. This is load-bearing for claims about general system implications and cost tradeoffs.

    Authors: We acknowledge that Section 5.2 lacks explicit discussion of benchmark selection and coverage. The suites were chosen for their established use in long-horizon agent tasks with varying interaction lengths and complexities. In revision we will expand the section with a rationale for their selection, summary statistics on task distributions (e.g., horizon lengths), and comparison to other potential workloads to clarify the scope of the reported system implications. revision: yes

Circularity Check

0 steps flagged

Empirical measurement study with no self-referential derivations or fitted predictions

full rationale

The paper introduces a taxonomy, builds a profiling harness, measures ten systems on two benchmarks, and derives recommendations from those measurements. No equations, parameter fits presented as predictions, or self-citations are used to justify central claims. The work is self-contained as an empirical characterization; representativeness of the sample is a validity concern but does not create circularity by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only empirical characterization paper; contains no mathematical derivations, fitted parameters, or postulated entities.

pith-pipeline@v0.9.1-grok · 5711 in / 946 out tokens · 22937 ms · 2026-06-28T01:07:41.162825+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Memory as a Wasting Asset: Pricing Flash Endurance for Embodied Agents, and the Limits of Doing So

    cs.AI 2026-06 unverdicted novelty 6.0

    Flash endurance is priced via shadow price η making placement cost-optimal for any sign of value-write correlation χ, with χ positive only in recurrent long-horizon manipulation and the budget binding only on low-endu...

Reference graph

Works this paper leans on

31 extracted references · 5 canonical work pages · cited by 1 Pith paper

  1. [1]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory,

    P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav, “Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory, ” 2025. [Online]. Available: https://arxiv.org/abs/2504.19413

  2. [2]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization,

    D. Edge, H. Trinh, N. Cheng, J. Bradley, A. Chao, A. Mody, S. Truitt, D. Metropolitansky, R. O. Ness, and J. Larson, “From Local to Global: A Graph RAG Approach to Query-Focused Summarization, ” 2025. [Online]. Available: https://arxiv.org/abs/2404.16130

  3. [3]

    From RAG to memory: non-parametric continual learning for large language models,

    B. J. Gutiérrez, Y. Shu, W. Qi, S. Zhou, and Y. Su, “From RAG to memory: non-parametric continual learning for large language models, ” inProceedings of the 42nd International Conference on Machine Learning, ser. ICML’25. JMLR.org, 2025

  4. [4]

    MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks,

    Z. He, Y. Wang, C. Zhi, Y. Hu, T.-P. Chen, L. Yin, Z. Chen, T. A. Wu, S. Ouyang, Z. Wang, J. Pei, J. McAuley, Y. Choi, and A. Pentland, “MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks, ” 2026. [Online]. Available: https://arxiv.org/abs/2602.16313

  5. [5]

    RULER: What’s the Real Context Size of Your Long-Context Language Models?

    C.-P. Hsieh, S. Sun, S. Kriman, S. Acharya, D. Rekesh, F. Jia, Y. Zhang, and B. Ginsburg, “RULER: What’s the Real Context Size of Your Long-Context Language Models?” 2024. [Online]. Available: https://arxiv.org/abs/2404.06654

  6. [6]

    Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions,

    Y. Hu, Y. Wang, and J. McAuley, “Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions, ” 2026. [Online]. Available: https://arxiv.org/abs/2507.05257

  7. [7]

    Billion-Scale Similarity Search with GPUs ,

    J. Johnson, M. Douze, and H. Jegou, “ Billion-Scale Similarity Search with GPUs , ”IEEE Transactions on Big Data, vol. 7, no. 03, pp. 535–547, Jul. 2021. [Online]. Available: https://doi.ieeecomputersociety.org/10.1109/TBDATA.2019.2921572

  8. [8]

    Dense Passage Retrieval for Open-Domain Question Answering,

    V. Karpukhin, B. Oguz, S. Min, P. Lewis, L. Wu, S. Edunov, D. Chen, and W.-t. Yih, “Dense Passage Retrieval for Open-Domain Question Answering, ” inProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), B. Webber, T. Cohn, Y. He, and Y. Liu, Eds. Online: Association for Computational Linguistics, Nov. 2020, pp. 676...

  9. [9]

    Efficient Memory Management for Large Language Model Serving with PagedAttention,

    W. Kwon, Z. Li, S. Zhuang, Y. Sheng, L. Zheng, C. H. Yu, J. Gonzalez, H. Zhang, and I. Stoica, “Efficient Memory Management for Large Language Model Serving with PagedAttention, ” inProceedings of the 29th Symposium on Operating Systems Principles, ser. SOSP ’23. New York, NY, USA: Association for Computing Machinery, 2023, p. 611–626. [Online]. Available...

  10. [10]

    Stateful Agents: The Missing Link in LLM Intelligence,

    Letta, “Stateful Agents: The Missing Link in LLM Intelligence, ” Febru- ary 2025. [Online]. Available: https://www.letta.com/blog/stateful- agents

  11. [11]

    Retrieval-Augmented Generation forKknowledge-Intensive NLP Tasks,

    P. Lewis, E. Perez, A. Piktus, F. Petroni, V. Karpukhin, N. Goyal, H. Küttler, M. Lewis, W.-t. Yih, T. Rocktäschel, S. Riedel, and D. Kiela, “Retrieval-Augmented Generation forKknowledge-Intensive NLP Tasks, ” inProceedings of the 34th International Conference on Neural Information Processing Systems, ser. NIPS ’20. Red Hook, NY, USA: Curran Associates Inc., 2020

  12. [12]

    SimpleMem: Efficient Lifelong Memory for LLM Agents,

    J. Liu, Y. Su, P. Xia, S. Han, Z. Zheng, C. Xie, M. Ding, and H. Yao, “SimpleMem: Efficient Lifelong Memory for LLM Agents, ” 2026. [Online]. Available: https://arxiv.org/abs/2601.02553 11

  13. [13]

    Lost in the Middle: How Language Models Use Long Contexts,

    N. F. Liu, K. Lin, J. Hewitt, A. Paranjape, M. Bevilacqua, F. Petroni, and P. Liang, “Lost in the Middle: How Language Models Use Long Contexts, ” vol. 12. Cambridge, MA: MIT Press, 2024, pp. 157–173. [Online]. Available: https://aclanthology.org/2024.tacl-1.9/

  14. [14]

    Evaluating Very Long-Term Conversational Memory of LLM Agents,

    A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang, “Evaluating Very Long-Term Conversational Memory of LLM Agents, ” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), L.-W. Ku, A. Martins, and V. Srikumar, Eds. Bangkok, Thailand: Association for Computational Linguisti...

  15. [15]

    FP8 Formats for Deep Learning,

    P. Micikevicius, D. Stosic, N. Burgess, M. Cornea, P. Dubey, R. Grisenthwaite, S. Ha, A. Heinecke, P. Judd, J. Kamalu, N. Mellempudi, S. Oberman, M. Shoeybi, M. Siu, and H. Wu, “FP8 Formats for Deep Learning, ” 2022. [Online]. Available: https://arxiv.org/abs/2209.05433

  16. [16]

    MemGPT: Towards LLMs as Operating Systems,

    C. Packer, S. Wooders, K. Lin, V. Fang, S. G. Patil, I. Stoica, and J. E. Gonzalez, “MemGPT: Towards LLMs as Operating Systems, ” 2024. [Online]. Available: https://arxiv.org/abs/2310.08560

  17. [17]

    Generative Agents: Interactive Simulacra of Human Behavior,

    J. S. Park, J. O’Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, “Generative Agents: Interactive Simulacra of Human Behavior, ” inProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology, ser. UIST ’23. New York, NY, USA: Association for Computing Machinery, 2023. [Online]. Available: https://doi.org/10.1145/35...

  18. [18]

    Zep: A Temporal Knowledge Graph Architecture for Agent Memory,

    P. Rasmussen, P. Paliychuk, T. Beauvais, J. Ryan, and D. Chalef, “Zep: A Temporal Knowledge Graph Architecture for Agent Memory, ”

  19. [19]

    Available: https://arxiv.org/abs/2501.13956

    [Online]. Available: https://arxiv.org/abs/2501.13956

  20. [20]

    The Probabilistic Relevance Framework: BM25 and Beyond,

    S. Robertson and H. Zaragoza, “The Probabilistic Relevance Framework: BM25 and Beyond, ”Found. Trends Inf. Retr., vol. 3, no. 4, p. 333–389, Apr. 2009. [Online]. Available: https://doi.org/10.1561/1500000019

  21. [21]

    Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory,

    T. Wei, N. Sachdeva, B. Coleman, Z. He, Y. Bei, X. Ning, M. Ai, Y. Li, J. He, E. H. Chi, C. Wang, S. Chen, F. Pereira, W.-C. Kang, and D. Z. Cheng, “Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory, ” 2026. [Online]. Available: https://arxiv.org/abs/2511.20857

  22. [22]

    Intelligence per watt: Measuring intelligence efficiency of local ai,

    J. Saad-Falcon, A. Narayan, H. O. Akengin, J. W. Griffin, H. Shandilya, A. G. Lafuente, M. Goel, R. Joseph, S. Natarajan, E. K. Guha, S. Zhu, B. Athiwaratkun, J. Hennessy, A. Mirhoseini, and C. Ré, “Intelligence per watt: Measuring intelligence efficiency of local ai, ”

  23. [23]

    Available: https://arxiv.org/abs/2511.07885

    [Online]. Available: https://arxiv.org/abs/2511.07885

  24. [24]

    RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval,

    P. Sarthi, S. Abdullah, A. Tuli, S. Khanna, A. Goldie, and C. D. Manning, “RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval, ” 2024. [Online]. Available: https://arxiv.org/abs/2401.18059

  25. [25]

    Reflexion: Language Agents with Verbal Reinforcement Learning,

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, “Reflexion: Language Agents with Verbal Reinforcement Learning, ” in Proceedings of the 37th International Conference on Neural Information Processing Systems, ser. NIPS ’23. Red Hook, NY, USA: Curran Associates Inc., 2023

  26. [26]

    Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context,

    G. Teamet al., “Gemini 1.5: Unlocking multimodal understanding across millions of tokens of context, ” 2024. [Online]. Available: https://arxiv.org/abs/2403.05530

  27. [27]

    vLLM quantization documentation: FP8,

    vLLM Project, “vLLM quantization documentation: FP8, ” 2025, online: https://docs.vllm.ai/en/latest/features/quantization/fp8.html

  28. [28]

    A Survey on Large Language Model Based Autonomous Agents,

    L. Wang, C. Ma, X. Feng, Z. Zhang, H. Yang, J. Zhang, Z. Chen, J. Tang, X. Chen, Y. Lin, W. X. Zhao, Z. Wei, and J. Wen, “A Survey on Large Language Model Based Autonomous Agents, ” Front. Comput. Sci., vol. 18, no. 6, Mar. 2024. [Online]. Available: https://doi.org/10.1007/s11704-024-40231-1

  29. [29]

    MIRIX: Multi-Agent Memory System for LLM-Based Agents,

    Y. Wang and X. Chen, “MIRIX: Multi-Agent Memory System for LLM-Based Agents, ” 2025. [Online]. Available: https://arxiv.org/abs/2507.07957

  30. [30]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory,

    D. Wu, H. Wang, W. Yu, Y. Zhang, K.-W. Chang, and D. Yu, “LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory, ” 2024. [Online]. Available: https://arxiv.org/abs/2410.10813

  31. [31]

    A-MEM: Agentic Memory for LLM Agents,

    W. Xu, Z. Liang, K. Mei, H. Gao, J. Tan, and Y. Zhang, “A-MEM: Agentic Memory for LLM Agents, ” 2025. [Online]. Available: https://arxiv.org/abs/2502.12110 12