pith. machine review for the scientific record. sign in

arxiv: 2605.11814 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: unknown

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:30 UTC · model grok-4.3

classification 💻 cs.AI
keywords MedMemoryBenchagent memorypersonalized healthcarememory saturationmedical reasoninglong-horizon trajectoriesstreaming evaluationAI benchmarking
0
0 comments X

The pith

MedMemoryBench reveals that mainstream AI agent architectures have severe bottlenecks in complex medical reasoning and noise resilience for personalized healthcare.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedMemoryBench to address the lack of suitable benchmarks for memory mechanisms in high-stakes medical applications. It develops a human-agent collaborative pipeline to synthesize realistic long-horizon trajectories from clinically grounded synthetic patient archetypes, producing around 2,000 sessions and 16,000 turns. The work formalizes memory saturation, where ongoing information influx degrades retrieval and reasoning, and uses a streaming evaluate-while-constructing protocol to test agents dynamically. Comprehensive tests show existing architectures struggle particularly with complex medical reasoning and noise. This matters because current benchmarks focus on casual conversations and do not capture the precision and safety needs of real personalized healthcare agents.

Core claim

MedMemoryBench consists of long-horizon medical trajectories synthesized via a human-agent pipeline from clinically grounded patient archetypes, assessed through a novel streaming evaluation protocol that mirrors production memory accumulation, and demonstrates that mainstream agent memory architectures exhibit severe limitations in complex medical reasoning and noise resilience.

What carries the argument

The streaming evaluate-while-constructing assessment protocol that tests memory dynamically as trajectories are built, paired with the formalization of memory saturation under sustained information influx.

If this is right

  • Agent memory designs must incorporate mechanisms to detect and mitigate memory saturation under continuous data influx.
  • Evaluations of medical agents should adopt dynamic streaming protocols rather than static snapshot tests.
  • Architectures require targeted improvements in handling complex, multi-turn medical reasoning chains.
  • Production healthcare systems will need specialized memory components beyond those used in open-domain agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The benchmark approach could extend to other long-term tracking domains such as chronic disease management outside the initial synthetic archetypes.
  • Specific memory implementations like hierarchical or episodic stores could be isolated and ranked on the saturation metric.
  • Validation against real patient data distributions might reveal additional edge cases not captured in the current archetypes.

Load-bearing premise

The human-agent collaborative pipeline and clinically grounded synthetic patient archetypes produce trajectories that faithfully capture the precision, safety, and long-term tracking demands of real-world personalized healthcare.

What would settle it

Direct comparison of the same agent models on MedMemoryBench trajectories versus anonymized logs from actual clinical deployments would show no matching pattern of degradation in medical reasoning or noise handling.

Figures

Figures reproduced from arXiv: 2605.11814 by Chunxiao Guo, Haoran Xu, Huan Li, Jinjie Gu, Ke Chen, Lidan Shou, Peng Wei, Renjie Gu, Xinyi Chen, Xinyu Mu, Yihao Wang, Yixuan Ye, Yuan Gao.

Figure 1
Figure 1. Figure 1: Four representative reasons illustrating why personalized healthcare places stricter demands on agent memory. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Validation results on LoCoMo under the Efficient and Mixed settings. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Overview of the MedMemoryBench data construction pipeline. The pipeline consists of patient profile [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Overview of the MedMemoryBench evaluation framework. The figure summarizes the streaming [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Performance on MedMemoryBench with different retrieval [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Time and token costs for memory building and query answering across representative memory methods. The values in the figure indicate the highest point. 6.5 Efficiency and Cost Analysis We evaluate cost across two stages, namely memory building and query answering. For token consumption, we count only LLM API tokens used for text processing, excluding embedding models [PITH_FULL_IMAGE:figures/full_fig_p009… view at source ↗
Figure 7
Figure 7. Figure 7: Human–agent consistency analysis based on confusion matrices across query types evaluated with LLM-as [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
read the original abstract

The large-scale deployment of personalized healthcare agents demands memory mechanisms that are exceptionally precise, safe, and capable of long-term clinical tracking. However, existing benchmarks primarily focus on daily open-domain conversations, failing to capture the high-stakes complexity of real-world medical applications. Motivated by the stringent production requirements of an industry-leading health management agent serving tens of millions of active users, we introduce MedMemoryBench. We develop a human-agent collaborative pipeline to synthesize highly realistic, long-horizon medical trajectories based on clinically grounded, synthetic patient archetypes. This process yields a massive, expertly validated dataset comprising approximately 2,000 sessions and 16,000 interaction turns. Crucially, MedMemoryBench departs from traditional static evaluations by pioneering an "evaluate-while-constructing" streaming assessment protocol, which precisely mirrors dynamic memory accumulation in production environments. Furthermore, we formalize and systematically investigate the critical phenomenon of memory saturation, where sustained information influx actively degrades retrieval and reasoning robustness. Comprehensive benchmarking reveals severe bottlenecks in mainstream architectures, particularly concerning complex medical reasoning and noise resilience. By exposing these fundamental flaws, MedMemoryBench establishes a vital foundation for developing robust, production-ready medical agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces MedMemoryBench, a benchmark for agent memory in personalized healthcare. It describes a human-agent collaborative pipeline that synthesizes long-horizon medical trajectories from clinically grounded synthetic patient archetypes, producing a dataset of approximately 2,000 sessions and 16,000 interaction turns. The work proposes a streaming 'evaluate-while-constructing' protocol to assess dynamic memory accumulation and formalizes the memory saturation phenomenon, where ongoing information influx degrades retrieval and reasoning. Benchmarking of mainstream architectures reveals severe limitations, especially in complex medical reasoning and noise resilience.

Significance. If the synthetic trajectories accurately reflect the precision, safety, and long-term tracking demands of real personalized healthcare, MedMemoryBench would offer a useful foundation for diagnosing architectural weaknesses in medical agents and motivating improvements in memory mechanisms for high-stakes applications. The emphasis on memory saturation as a distinct failure mode is a potentially valuable contribution that could shape future work on robust agent memory.

major comments (2)
  1. [Dataset Synthesis Pipeline] The dataset construction section claims the trajectories are 'highly realistic' and 'clinically grounded' via the human-agent pipeline, yet provides no quantitative external validation (e.g., statistical comparison of information density, error propagation, or longitudinal consistency against de-identified real patient logs). This is load-bearing for the central claim of severe bottlenecks in complex medical reasoning and noise resilience, as unverified synthetic fidelity could make observed saturation and retrieval failures artifacts of benchmark construction rather than intrinsic limits.
  2. [Evaluation Protocol] The evaluation protocol section introduces the 'evaluate-while-constructing' streaming assessment but supplies no specific metrics, error analysis of the synthesis pipeline, or quantitative details on how it captures dynamic memory accumulation, the reported bottlenecks, or noise resilience. Without these, the benchmarking results lack the grounding needed to support the headline findings on architectural limitations.
minor comments (2)
  1. [Methods] Clarify the exact criteria and process for 'expert validation' of the 2,000 sessions in the methods section to improve reproducibility.
  2. [Results] Ensure all figures showing saturation curves include error bars or confidence intervals for the reported degradation in retrieval performance.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on MedMemoryBench. We address each major comment point-by-point below, providing clarifications where possible and committing to revisions that strengthen the manuscript without overstating the current evidence.

read point-by-point responses
  1. Referee: [Dataset Synthesis Pipeline] The dataset construction section claims the trajectories are 'highly realistic' and 'clinically grounded' via the human-agent pipeline, yet provides no quantitative external validation (e.g., statistical comparison of information density, error propagation, or longitudinal consistency against de-identified real patient logs). This is load-bearing for the central claim of severe bottlenecks in complex medical reasoning and noise resilience, as unverified synthetic fidelity could make observed saturation and retrieval failures artifacts of benchmark construction rather than intrinsic limits.

    Authors: We acknowledge the absence of direct quantitative comparisons to real patient logs. Privacy regulations and institutional data-access policies prevent statistical benchmarking against de-identified real trajectories. The pipeline instead relies on clinically grounded synthetic archetypes co-developed with medical experts, followed by iterative human review in the collaborative loop. In revision we will expand the dataset section with: (i) the number and qualifications of expert reviewers, (ii) inter-rater agreement statistics on clinical fidelity, and (iii) a limitations paragraph explicitly discussing the synthetic-data gap. We maintain that the observed saturation patterns are consistent across five distinct agent architectures, which would be unlikely if the failures were purely artifacts of unverified synthesis. revision: partial

  2. Referee: [Evaluation Protocol] The evaluation protocol section introduces the 'evaluate-while-constructing' streaming assessment but supplies no specific metrics, error analysis of the synthesis pipeline, or quantitative details on how it captures dynamic memory accumulation, the reported bottlenecks, or noise resilience. Without these, the benchmarking results lack the grounding needed to support the headline findings on architectural limitations.

    Authors: We agree that the current description of the streaming protocol is underspecified. In the revised manuscript we will add: (i) explicit metrics (retrieval precision/recall as a function of session length, medical-reasoning accuracy, and noise-resilience scores), (ii) quantitative error analysis of the synthesis pipeline (consistency checks and propagation estimates), and (iii) step-by-step illustrations showing how memory state is evaluated after each turn. These additions will directly link the protocol to the reported saturation and bottleneck findings. revision: yes

standing simulated objections not resolved
  • Quantitative external validation of synthetic trajectories against real de-identified patient logs (precluded by privacy regulations and data-access restrictions)

Circularity Check

0 steps flagged

No circularity: benchmark construction and evaluations are independent

full rationale

The paper introduces MedMemoryBench through an explicit human-agent collaborative pipeline over synthetic patient archetypes, producing a new dataset of ~2,000 sessions. Benchmarking results on mainstream architectures (including saturation effects) are direct empirical measurements on this freshly constructed data rather than reductions to prior fitted parameters, self-citations, or definitional equivalences. No equations, uniqueness theorems, or ansatzes are invoked that collapse the claimed bottlenecks back to the input construction by construction. The 'evaluate-while-constructing' protocol is presented as a methodological choice mirroring production use, not a tautology with the observed outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that synthetic archetypes and the collaborative pipeline faithfully reproduce clinical complexity; no free parameters or invented entities are introduced.

axioms (1)
  • domain assumption Synthetic patient archetypes and human-agent pipeline produce trajectories that match real clinical precision and long-term tracking needs
    Invoked to justify the dataset as a proxy for production healthcare agents

pith-pipeline@v0.9.0 · 5541 in / 1140 out tokens · 49829 ms · 2026-05-13T06:30:39.337949+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 14 internal anchors

  1. [1]

    Ant Afu - Your AI Health Companion.https://www.antafu.com, 12 2025

    Ant Group. Ant Afu - Your AI Health Companion.https://www.antafu.com, 12 2025. Accessed: 2026-05-07

  2. [2]

    Longbench: A bilingual, multitask benchmark for long context understanding

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

  3. [3]

    Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

    Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, and Hai Zhao. Remember me, refine me: A dynamic procedural memory framework for experience-driven agent evolution.arXiv preprint arXiv:2512.10696, 2025. ACL 2026 Findings

  4. [4]

    Halumem: Evaluating hallucinations in memory systems of agents

    Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, and Zhiyu Li. Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506, 2025

  5. [5]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

  6. [6]

    A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1): 37–46, 1960

    Jacob Cohen. A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1): 37–46, 1960

  7. [7]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

  8. [8]

    Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

    Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

  9. [9]

    Novel memory forgetting techniques for autonomous ai agents: Balancing relevance and efficiency.arXiv preprint arXiv:2604.02280, 2026

    Payal Fofadiya and Sunil Tiwari. Novel memory forgetting techniques for autonomous ai agents: Balancing relevance and efficiency.arXiv preprint arXiv:2604.02280, 2026

  10. [10]

    Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532– 59569, 2024

    Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532– 59569, 2024

  11. [11]

    From RAG to memory: Non- parametric continual learning for large language models.CoRR, abs/2502.14802, 2025

    Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From rag to memory: Non-parametric continual learning for large language models, 2025. URLhttps://arxiv.org/abs/2502.14802

  12. [12]

    Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

    Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks. arXiv preprint arXiv:2602.16313, 2026

  13. [13]

    RULER: What's the Real Context Size of Your Long-Context Language Models?

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

  14. [14]

    Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

    Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

  15. [15]

    Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

    Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al. Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

  16. [16]

    Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025

    Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, and Francesco Barbieri. Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025

  17. [17]

    Retrieval-augmented generation for knowledge- intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

  18. [18]

    Hello again! LLM-powered personalized agent for long-term dialogue

    Yunfan Li et al. Hello again! LLM-powered personalized agent for long-term dialogue. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025. 10 A preprint

  19. [19]

    Memos: An operating system for memory-augmented generation (mag) in large language models.arXiv preprint arXiv:2505.22101, 2025

    Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, et al. Memos: An operating system for memory-augmented generation (mag) in large language models.arXiv preprint arXiv:2505.22101, 2025

  20. [20]

    arXiv preprint arXiv:2601.02553 , year=

    Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

  21. [21]

    Evalu- ating very long-term conversational memory of llm agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evalu- ating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

  22. [22]

    Memgpt: towards llms as operating systems, 2023

    Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems, 2023

  23. [23]

    Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation

    Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation. InProceedings of the ACM on Web Conference 2025, pages 2366–2377, 2024

  24. [24]

    Memobrain: Executive memory as an agentic brain for reasoning.arXiv preprint arXiv:2601.08079, 2025

    Hongjin Qian, Zhao Cao, and Zheng Liu. Memobrain: Executive memory as an agentic brain for reasoning.arXiv preprint arXiv:2601.08079, 2025

  25. [25]

    Zep: A Temporal Knowledge Graph Architecture for Agent Memory

    Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025

  26. [26]

    Healthcare agent: Elic- iting the power of large language models for medical consultation.npj Artificial Intelligence, 1(24), 2025

    Zhiyao Ren, Yibing Zhan, Baosheng Yu, Liang Ding, Pingbo Xu, and Dacheng Tao. Healthcare agent: Elic- iting the power of large language models for medical consultation.npj Artificial Intelligence, 1(24), 2025. doi:10.1038/s44387-025-00021-x

  27. [27]

    The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

    Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

  28. [28]

    ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

    Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

  29. [29]

    Remem: Reasoning with episodic memory in language agent.arXiv preprint arXiv:2602.13530, 2026

    Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. Remem: Reasoning with episodic memory in language agent.arXiv preprint arXiv:2602.13530, 2026

  30. [30]

    Membench: Towards more comprehensive evaluation on the memory of llm-based agents

    Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025

  31. [31]

    From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

    Md Nayem Uddin, Kumar Shubham, Eduardo Blanco, Chitta Baral, and Gengyu Wang. From recall to forgetting: Benchmarking long-term memory for personalized agents.arXiv preprint arXiv:2604.20006, 2026

  32. [32]

    MIRIX: Multi-Agent Memory System for LLM-Based Agents

    Yu Wang and Xi Chen. Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

  33. [33]

    arXiv preprint arXiv:2509.25911 , year=

    Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. Mem-α: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

  34. [34]

    LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

  35. [35]

    KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

    Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, et al. Knowme-bench: Benchmarking person understanding for lifelong digital companions. arXiv preprint arXiv:2601.04745, 2026

  36. [36]

    An agent-based adaptive medical dialogue service for personalized healthcare.Information Processing & Management, 62(3), 2025

    Fangfang Xu et al. An agent-based adaptive medical dialogue service for personalized healthcare.Information Processing & Management, 62(3), 2025

  37. [37]

    A-MEM: Agentic Memory for LLM Agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025. 11 A preprint

  38. [38]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, et al. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

  39. [39]

    arXiv preprint arXiv:2507.02259 , year=

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

  40. [40]

    Mem-t: Densifying rewards for long-horizon memory agents.arXiv preprint arXiv:2601.23014, 2026

    Yanwei Yue, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, and Yan Zhang. Mem-t: Densifying rewards for long-horizon memory agents.arXiv preprint arXiv:2601.23014, 2026

  41. [41]

    Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026

    Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, et al. Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory. arXiv preprint arXiv:2601.03192, 2026

  42. [42]

    Ama-bench: Evaluating long-horizon memory for agentic applications, 2026

    Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, et al. Ama-bench: Evaluating long-horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

  43. [43]

    Memorybank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

  44. [44]

    WebArena: A Realistic Web Environment for Building Autonomous Agents

    Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

  45. [45]

    MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

    Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents. arXiv preprint arXiv:2506.15841, 2025. 12 A preprint Table 4: Detailed statistics of MedMemoryBench. Original Dialogue Statistics QA Benchmark S...

  46. [46]

    Provide the target entity name directly

  47. [47]

    Keep the answer brief and precise

  48. [48]

    Answer: TheEEMtemplate is designed for precise slot-level retrieval

    Do not include lengthy explanations. Answer: TheEEMtemplate is designed for precise slot-level retrieval. It therefore emphasizes direct extraction of the target entity rather than extended explanation. TLA Answer Prompt Context: Based on <memory_source>, accurately answer the following question. Question: <question> Answer Requirements:

  49. [49]

    If the question asks about a time, answer in YYYY-MM-DD format (e.g., 2024-01-15)

  50. [50]

    19 A preprint

    If the question asks about an event at a specific time, clearly describe the event content and key details. 19 A preprint

  51. [51]

    Keep the answer concise and directly grounded in memory. Answer: TheTLAtemplate explicitly constrains temporal questions to normalized date outputs whenever possible, while still allowing concise event descriptions when the query asks what happened at a particular time point. SUA Answer Prompt Context: Based on <memory_source>, accurately answer the follo...

  52. [52]

    Describe the patient’s most recent status

  53. [53]

    Reflect important changes over time when necessary

  54. [55]

    Be concise and direct. Answer: TheSUAprompt emphasizes up-to-date patient status and trajectory-aware summarization, which is important for questions that ask for the latest condition rather than isolated historical facts. MQ Answer Prompt Context: Based on <memory_source>, and considering the patient’s allergy history, medical history, medications, and p...

  55. [56]

    Select all correct options

  56. [57]

    Output only the option letter(s), such as B or B,D

  57. [58]

    Do not provide any explanation. Answer: TheMQtemplate enforces a strict multiple-choice output format, which simplifies automatic evaluation and avoids verbose justifications that are irrelevant to the benchmark target. IG Answer Prompt Context: Based on <memory_source>, and considering the patient’s allergy history, medical history, medications, and pers...

  58. [59]

    Reason from this patient’s specific remembered information; do not give generic medical advice

  59. [60]

    Maintain a warm yet professional tone

  60. [61]

    Be concise, direct, and avoid boilerplate

  61. [62]

    Answer: TheIGtemplate is designed for personalized medical inference

    If recommending or advising against something, briefly explain the reason based on the patient’s specific situation. Answer: TheIGtemplate is designed for personalized medical inference. It explicitly discourages generic recommendations and instead requires patient-grounded reasoning based on remembered allergy history, medication use, prior diagnoses, an...

  62. [63]

    Clearly list the memory content you draw upon

  63. [64]

    Present a clear reasoning path from evidence to conclusions

  64. [65]

    Answer: Finally, theMCDtemplate targets queries that require multi-visit synthesis and explicit reasoning over dispersed historical evidence

    Provide a final comprehensive judgment. Answer: Finally, theMCDtemplate targets queries that require multi-visit synthesis and explicit reasoning over dispersed historical evidence. Compared with the other templates, it places the strongest emphasis on transparent reasoning paths and comprehensive judgment grounded in multiple memory items. D.2 LLM-as-Jud...

  65. [66]

    The model must correctly provide the time point

    Asking when a certain event occurred. The model must correctly provide the time point

  66. [67]

    January 1, 2024

    Asking what happened at a certain time. The model must correctly describe the event content Judge strictly: - If the model’s answer contains the correct time point or the correct event content, judge as [CORRECT] - If the model’s answer about the time/event does not match the reference answer or fails to answer, judge as [INCORRECT] - Date formats do not ...

  67. [68]

    Must be based on memory: The model’s answer must demonstrate the use of the patient’s past memory information, not guessing or generic medical knowledge

  68. [69]

    coincidentally correct

    No guessing allowed: If the model has not retrieved relevant memory information but gives a “coincidentally correct” answer, it should be judged as [INCORRECT]

  69. [70]

    remembers

    Information source requirement: A correct answer should convey that the model “remembers” this patient’s specific situation, rather than guessing. Judge strictly: - If the model’s answer demonstrates the use of patient historical memory and the core content is consistent with the reference answer, judge as [CORRECT] - If the model’s answer contains key in...

  70. [71]

    Patient Information Utilization (Key) - The model must demonstrate the use of patient-specific information from memory - If required_patient_info is provided in metadata, the model’s answer must reflect understanding of these key pieces of information (important) - If the patient’s specific circumstances and past memories are ignored or missing, judge as ...

  71. [72]

    common wrong answer

    Reasoning Quality - The model must reason based on retrieved patient historical information, not purely from its own medical common sense - If only a conclusion is given without sufficient reference to patient information and memory, judge as [INCORRECT] - If the model gives a “common wrong answer” type of response (generic advice), judge as [INCORRECT]

  72. [73]

    is_correct

    Conclusion Correctness - The final recommendation/conclusion should be fully consistent with the reference answer in direction - Even if the conclusion is correct, if it lacks reasoning based on patient information, still judge as [INCORRECT] Judgment rules: - [CORRECT]: Answer uses patient-specific information, contains required patient information point...

  73. [74]

    Poor blood sugar control may lead to

    Patient-Specific Information Principle: The model must explicitly reference the patient’s specific data (such as specific test values, medication dosages, specific timing of symptom onset, particular diagnostic results), rather than giving generic medical common sense. - “Poor blood sugar control may lead to...”. This is generic medical knowledge, not pat...

  74. [75]

    remembers

    Memory Retrieval Evidence Principle: If the model fails to demonstrate specific references to the patient’s historical records, even if the reasoning direction is correct, it should be judged as inadequate. The model must show it “remembers” the patient’s specific situation

  75. [76]

    Similar but different mechanisms cannot substitute

    Strict Causal Chain Correspondence Principle: The causal relationships established by the model must precisely correspond to the causal mechanisms described in the reasoning chain nodes. Similar but different mechanisms cannot substitute

  76. [77]

    TSH 0.02

    Node Content Precise Matching Principle: During node verification, it is not sufficient to judge as “covered” merely because the model mentioned a related concept. You must verify whether the model referenced the core specific content within the node. Evaluation Steps Step 1: Strict Node-by-Node Check For each node in the reasoning chain, all of the follo...