arxiv: 2605.11814 · v1 · submitted 2026-05-12 · 💻 cs.AI

Recognition: unknown

MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

Yihao Wang , Haoran Xu , Renjie Gu , Yixuan Ye , Xinyi Chen , Xinyu Mu , Yuan Gao , Chunxiao Guo

show 5 more authors

Peng Wei Jinjie Gu Huan Li Ke Chen Lidan Shou

Authors on Pith no claims yet

Pith reviewed 2026-05-13 06:30 UTC · model grok-4.3

classification 💻 cs.AI

keywords MedMemoryBenchagent memorypersonalized healthcarememory saturationmedical reasoninglong-horizon trajectoriesstreaming evaluationAI benchmarking

0 comments

The pith

MedMemoryBench reveals that mainstream AI agent architectures have severe bottlenecks in complex medical reasoning and noise resilience for personalized healthcare.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces MedMemoryBench to address the lack of suitable benchmarks for memory mechanisms in high-stakes medical applications. It develops a human-agent collaborative pipeline to synthesize realistic long-horizon trajectories from clinically grounded synthetic patient archetypes, producing around 2,000 sessions and 16,000 turns. The work formalizes memory saturation, where ongoing information influx degrades retrieval and reasoning, and uses a streaming evaluate-while-constructing protocol to test agents dynamically. Comprehensive tests show existing architectures struggle particularly with complex medical reasoning and noise. This matters because current benchmarks focus on casual conversations and do not capture the precision and safety needs of real personalized healthcare agents.

Core claim

MedMemoryBench consists of long-horizon medical trajectories synthesized via a human-agent pipeline from clinically grounded patient archetypes, assessed through a novel streaming evaluation protocol that mirrors production memory accumulation, and demonstrates that mainstream agent memory architectures exhibit severe limitations in complex medical reasoning and noise resilience.

What carries the argument

The streaming evaluate-while-constructing assessment protocol that tests memory dynamically as trajectories are built, paired with the formalization of memory saturation under sustained information influx.

If this is right

Agent memory designs must incorporate mechanisms to detect and mitigate memory saturation under continuous data influx.
Evaluations of medical agents should adopt dynamic streaming protocols rather than static snapshot tests.
Architectures require targeted improvements in handling complex, multi-turn medical reasoning chains.
Production healthcare systems will need specialized memory components beyond those used in open-domain agents.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The benchmark approach could extend to other long-term tracking domains such as chronic disease management outside the initial synthetic archetypes.
Specific memory implementations like hierarchical or episodic stores could be isolated and ranked on the saturation metric.
Validation against real patient data distributions might reveal additional edge cases not captured in the current archetypes.

Load-bearing premise

The human-agent collaborative pipeline and clinically grounded synthetic patient archetypes produce trajectories that faithfully capture the precision, safety, and long-term tracking demands of real-world personalized healthcare.

What would settle it

Direct comparison of the same agent models on MedMemoryBench trajectories versus anonymized logs from actual clinical deployments would show no matching pattern of degradation in medical reasoning or noise handling.

Figures

Figures reproduced from arXiv: 2605.11814 by Chunxiao Guo, Haoran Xu, Huan Li, Jinjie Gu, Ke Chen, Lidan Shou, Peng Wei, Renjie Gu, Xinyi Chen, Xinyu Mu, Yihao Wang, Yixuan Ye, Yuan Gao.

**Figure 1.** Figure 1: Four representative reasons illustrating why personalized healthcare places stricter demands on agent memory. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: Validation results on LoCoMo under the Efficient and Mixed settings. [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 3.** Figure 3: Overview of the MedMemoryBench data construction pipeline. The pipeline consists of patient profile [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗

**Figure 4.** Figure 4: Overview of the MedMemoryBench evaluation framework. The figure summarizes the streaming [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗

**Figure 5.** Figure 5: Performance on MedMemoryBench with different retrieval [PITH_FULL_IMAGE:figures/full_fig_p009_5.png] view at source ↗

**Figure 6.** Figure 6: Time and token costs for memory building and query answering across representative memory methods. The values in the figure indicate the highest point. 6.5 Efficiency and Cost Analysis We evaluate cost across two stages, namely memory building and query answering. For token consumption, we count only LLM API tokens used for text processing, excluding embedding models [PITH_FULL_IMAGE:figures/full_fig_p009… view at source ↗

**Figure 7.** Figure 7: Human–agent consistency analysis based on confusion matrices across query types evaluated with LLM-as [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗

read the original abstract

The large-scale deployment of personalized healthcare agents demands memory mechanisms that are exceptionally precise, safe, and capable of long-term clinical tracking. However, existing benchmarks primarily focus on daily open-domain conversations, failing to capture the high-stakes complexity of real-world medical applications. Motivated by the stringent production requirements of an industry-leading health management agent serving tens of millions of active users, we introduce MedMemoryBench. We develop a human-agent collaborative pipeline to synthesize highly realistic, long-horizon medical trajectories based on clinically grounded, synthetic patient archetypes. This process yields a massive, expertly validated dataset comprising approximately 2,000 sessions and 16,000 interaction turns. Crucially, MedMemoryBench departs from traditional static evaluations by pioneering an "evaluate-while-constructing" streaming assessment protocol, which precisely mirrors dynamic memory accumulation in production environments. Furthermore, we formalize and systematically investigate the critical phenomenon of memory saturation, where sustained information influx actively degrades retrieval and reasoning robustness. Comprehensive benchmarking reveals severe bottlenecks in mainstream architectures, particularly concerning complex medical reasoning and noise resilience. By exposing these fundamental flaws, MedMemoryBench establishes a vital foundation for developing robust, production-ready medical agents.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MedMemoryBench adds a streaming protocol and saturation analysis for healthcare agent memory but its synthetic data lacks real-world checks.

read the letter

The main takeaway is that MedMemoryBench provides a new way to test memory in healthcare agents using streaming evaluation on synthetic medical trajectories, but the claims about severe bottlenecks rest on unverified synthetic data. The paper does well by moving away from static open-domain tests to something that matches production environments where memory accumulates over time. Formalizing memory saturation as a degradation point under sustained input is a useful addition, and the 2000-session dataset gives a substantial base for experiments. They show clear issues in complex reasoning and noise resilience across architectures, which aligns with the needs of real clinical tracking. The concern is that the human-agent collaborative synthesis over synthetic archetypes has no external check against actual patient data. Without stats on how the generated sessions compare in density, consistency, or error types to real logs, the observed saturation and retrieval problems might not generalize. That's a moderate issue given how central the data is to the results. Citations seem appropriate, building on existing agent memory literature without over-relying on self-references. This paper is for researchers developing memory mechanisms for deployed medical agents. It would be useful for anyone needing a benchmark that emphasizes long-horizon precision and safety. I would bring it to a reading group to talk through the streaming protocol and saturation experiments. It should go to peer review because the benchmark fills a real gap and the evaluation approach is practical, though revisions would likely focus on data validation.

Referee Report

2 major / 2 minor

Summary. The paper introduces MedMemoryBench, a benchmark for agent memory in personalized healthcare. It describes a human-agent collaborative pipeline that synthesizes long-horizon medical trajectories from clinically grounded synthetic patient archetypes, producing a dataset of approximately 2,000 sessions and 16,000 interaction turns. The work proposes a streaming 'evaluate-while-constructing' protocol to assess dynamic memory accumulation and formalizes the memory saturation phenomenon, where ongoing information influx degrades retrieval and reasoning. Benchmarking of mainstream architectures reveals severe limitations, especially in complex medical reasoning and noise resilience.

Significance. If the synthetic trajectories accurately reflect the precision, safety, and long-term tracking demands of real personalized healthcare, MedMemoryBench would offer a useful foundation for diagnosing architectural weaknesses in medical agents and motivating improvements in memory mechanisms for high-stakes applications. The emphasis on memory saturation as a distinct failure mode is a potentially valuable contribution that could shape future work on robust agent memory.

major comments (2)

[Dataset Synthesis Pipeline] The dataset construction section claims the trajectories are 'highly realistic' and 'clinically grounded' via the human-agent pipeline, yet provides no quantitative external validation (e.g., statistical comparison of information density, error propagation, or longitudinal consistency against de-identified real patient logs). This is load-bearing for the central claim of severe bottlenecks in complex medical reasoning and noise resilience, as unverified synthetic fidelity could make observed saturation and retrieval failures artifacts of benchmark construction rather than intrinsic limits.
[Evaluation Protocol] The evaluation protocol section introduces the 'evaluate-while-constructing' streaming assessment but supplies no specific metrics, error analysis of the synthesis pipeline, or quantitative details on how it captures dynamic memory accumulation, the reported bottlenecks, or noise resilience. Without these, the benchmarking results lack the grounding needed to support the headline findings on architectural limitations.

minor comments (2)

[Methods] Clarify the exact criteria and process for 'expert validation' of the 2,000 sessions in the methods section to improve reproducibility.
[Results] Ensure all figures showing saturation curves include error bars or confidence intervals for the reported degradation in retrieval performance.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive and detailed feedback on MedMemoryBench. We address each major comment point-by-point below, providing clarifications where possible and committing to revisions that strengthen the manuscript without overstating the current evidence.

read point-by-point responses

Referee: [Dataset Synthesis Pipeline] The dataset construction section claims the trajectories are 'highly realistic' and 'clinically grounded' via the human-agent pipeline, yet provides no quantitative external validation (e.g., statistical comparison of information density, error propagation, or longitudinal consistency against de-identified real patient logs). This is load-bearing for the central claim of severe bottlenecks in complex medical reasoning and noise resilience, as unverified synthetic fidelity could make observed saturation and retrieval failures artifacts of benchmark construction rather than intrinsic limits.

Authors: We acknowledge the absence of direct quantitative comparisons to real patient logs. Privacy regulations and institutional data-access policies prevent statistical benchmarking against de-identified real trajectories. The pipeline instead relies on clinically grounded synthetic archetypes co-developed with medical experts, followed by iterative human review in the collaborative loop. In revision we will expand the dataset section with: (i) the number and qualifications of expert reviewers, (ii) inter-rater agreement statistics on clinical fidelity, and (iii) a limitations paragraph explicitly discussing the synthetic-data gap. We maintain that the observed saturation patterns are consistent across five distinct agent architectures, which would be unlikely if the failures were purely artifacts of unverified synthesis. revision: partial
Referee: [Evaluation Protocol] The evaluation protocol section introduces the 'evaluate-while-constructing' streaming assessment but supplies no specific metrics, error analysis of the synthesis pipeline, or quantitative details on how it captures dynamic memory accumulation, the reported bottlenecks, or noise resilience. Without these, the benchmarking results lack the grounding needed to support the headline findings on architectural limitations.

Authors: We agree that the current description of the streaming protocol is underspecified. In the revised manuscript we will add: (i) explicit metrics (retrieval precision/recall as a function of session length, medical-reasoning accuracy, and noise-resilience scores), (ii) quantitative error analysis of the synthesis pipeline (consistency checks and propagation estimates), and (iii) step-by-step illustrations showing how memory state is evaluated after each turn. These additions will directly link the protocol to the reported saturation and bottleneck findings. revision: yes

standing simulated objections not resolved

Quantitative external validation of synthetic trajectories against real de-identified patient logs (precluded by privacy regulations and data-access restrictions)

Circularity Check

0 steps flagged

No circularity: benchmark construction and evaluations are independent

full rationale

The paper introduces MedMemoryBench through an explicit human-agent collaborative pipeline over synthetic patient archetypes, producing a new dataset of ~2,000 sessions. Benchmarking results on mainstream architectures (including saturation effects) are direct empirical measurements on this freshly constructed data rather than reductions to prior fitted parameters, self-citations, or definitional equivalences. No equations, uniqueness theorems, or ansatzes are invoked that collapse the claimed bottlenecks back to the input construction by construction. The 'evaluate-while-constructing' protocol is presented as a methodological choice mirroring production use, not a tautology with the observed outcomes.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on the domain assumption that synthetic archetypes and the collaborative pipeline faithfully reproduce clinical complexity; no free parameters or invented entities are introduced.

axioms (1)

domain assumption Synthetic patient archetypes and human-agent pipeline produce trajectories that match real clinical precision and long-term tracking needs
Invoked to justify the dataset as a proxy for production healthcare agents

pith-pipeline@v0.9.0 · 5541 in / 1140 out tokens · 49829 ms · 2026-05-13T06:30:39.337949+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · 14 internal anchors

[1]

Ant Afu - Your AI Health Companion.https://www.antafu.com, 12 2025

Ant Group. Ant Afu - Your AI Health Companion.https://www.antafu.com, 12 2025. Accessed: 2026-05-07

work page 2025
[2]

Longbench: A bilingual, multitask benchmark for long context understanding

Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, et al. Longbench: A bilingual, multitask benchmark for long context understanding. InProceedings of the 62nd annual meeting of the association for computational linguistics (volume 1: Long papers), pages 3119–3137, 2024

work page 2024
[3]

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution

Zouying Cao, Jiaji Deng, Li Yu, Weikang Zhou, Zhaoyang Liu, Bolin Ding, and Hai Zhao. Remember me, refine me: A dynamic procedural memory framework for experience-driven agent evolution.arXiv preprint arXiv:2512.10696, 2025. ACL 2026 Findings

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

Halumem: Evaluating hallucinations in memory systems of agents

Ding Chen, Simin Niu, Kehang Li, Peng Liu, Xiangping Zheng, Bo Tang, Xinchi Li, Feiyu Xiong, and Zhiyu Li. Halumem: Evaluating hallucinations in memory systems of agents.arXiv preprint arXiv:2511.03506, 2025

work page arXiv 2025
[5]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production- ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1): 37–46, 1960

Jacob Cohen. A coefficient of agreement for nominal scales.Educational and psychological measurement, 20(1): 37–46, 1960

work page 1960
[7]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[8]

Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, et al. Lightmem: Lightweight and efficient memory-augmented generation.arXiv preprint arXiv:2510.18866, 2025

work page arXiv 2025
[9]

Novel memory forgetting techniques for autonomous ai agents: Balancing relevance and efficiency.arXiv preprint arXiv:2604.02280, 2026

Payal Fofadiya and Sunil Tiwari. Novel memory forgetting techniques for autonomous ai agents: Balancing relevance and efficiency.arXiv preprint arXiv:2604.02280, 2026

work page arXiv 2026
[10]

Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532– 59569, 2024

Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532– 59569, 2024

work page 2024
[11]

From RAG to memory: Non- parametric continual learning for large language models.CoRR, abs/2502.14802, 2025

Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From rag to memory: Non-parametric continual learning for large language models, 2025. URLhttps://arxiv.org/abs/2502.14802

work page arXiv 2025
[12]

Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks. arXiv preprint arXiv:2602.16313, 2026

work page arXiv 2026
[13]

RULER: What's the Real Context Size of Your Long-Context Language Models?

Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, Yang Zhang, and Boris Ginsburg. Ruler: What’s the real context size of your long-context language models?arXiv preprint arXiv:2404.06654, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

work page arXiv 2025
[15]

Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al. Personamem-v2: Towards personalized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

work page arXiv 2025
[16]

Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025

Dong-Ho Lee, Adyasha Maharana, Jay Pujara, Xiang Ren, and Francesco Barbieri. Realtalk: A 21-day real-world dataset for long-term conversation.arXiv preprint arXiv:2502.13270, 2025

work page arXiv 2025
[17]

Retrieval-augmented generation for knowledge- intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge- intensive NLP tasks.Advances in Neural Information Processing Systems, 33:9459–9474, 2020

work page 2020
[18]

Hello again! LLM-powered personalized agent for long-term dialogue

Yunfan Li et al. Hello again! LLM-powered personalized agent for long-term dialogue. InProceedings of the 2025 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, 2025. 10 A preprint

work page 2025
[19]

Memos: An operating system for memory-augmented generation (mag) in large language models.arXiv preprint arXiv:2505.22101, 2025

Zhiyu Li, Shichao Song, Hanyu Wang, Simin Niu, Ding Chen, Jiawei Yang, Chenyang Xi, Huayi Lai, Jihao Zhao, Yezhaohui Wang, et al. Memos: An operating system for memory-augmented generation (mag) in large language models.arXiv preprint arXiv:2505.22101, 2025

work page arXiv 2025
[20]

arXiv preprint arXiv:2601.02553 , year=

Jiaqi Liu, Yaofeng Su, Peng Xia, Siwei Han, Zeyu Zheng, Cihang Xie, Mingyu Ding, and Huaxiu Yao. Simplemem: Efficient lifelong memory for llm agents.arXiv preprint arXiv:2601.02553, 2026

work page arXiv 2026
[21]

Evalu- ating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evalu- ating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

work page 2024
[22]

Memgpt: towards llms as operating systems, 2023

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonzalez. Memgpt: towards llms as operating systems, 2023

work page 2023
[23]

Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation

Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. Memorag: Boosting long context processing with global memory-enhanced retrieval augmentation. InProceedings of the ACM on Web Conference 2025, pages 2366–2377, 2024

work page 2025
[24]

Memobrain: Executive memory as an agentic brain for reasoning.arXiv preprint arXiv:2601.08079, 2025

Hongjin Qian, Zhao Cao, and Zheng Liu. Memobrain: Executive memory as an agentic brain for reasoning.arXiv preprint arXiv:2601.08079, 2025

work page arXiv 2025
[25]

Zep: A Temporal Knowledge Graph Architecture for Agent Memory

Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. Zep: A temporal knowledge graph architecture for agent memory.arXiv preprint arXiv:2501.13956, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Healthcare agent: Elic- iting the power of large language models for medical consultation.npj Artificial Intelligence, 1(24), 2025

Zhiyao Ren, Yibing Zhan, Baosheng Yu, Liang Ding, Pingbo Xu, and Dacheng Tao. Healthcare agent: Elic- iting the power of large language models for medical consultation.npj Artificial Intelligence, 1(24), 2025. doi:10.1038/s44387-025-00021-x

work page doi:10.1038/s44387-025-00021-x 2025
[27]

The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

Stephen Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends in Information Retrieval, 3(4):333–389, 2009

work page 2009
[28]

ALFWorld: Aligning Text and Embodied Environments for Interactive Learning

Mohit Shridhar, Xingdi Yuan, Marc-Alexandre Côté, Yonatan Bisk, Adam Trischler, and Matthew Hausknecht. Alfworld: Aligning text and embodied environments for interactive learning.arXiv preprint arXiv:2010.03768, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2010
[29]

Remem: Reasoning with episodic memory in language agent.arXiv preprint arXiv:2602.13530, 2026

Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. Remem: Reasoning with episodic memory in language agent.arXiv preprint arXiv:2602.13530, 2026

work page arXiv 2026
[30]

Membench: Towards more comprehensive evaluation on the memory of llm-based agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025

work page 2025
[31]

From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

Md Nayem Uddin, Kumar Shubham, Eduardo Blanco, Chitta Baral, and Gengyu Wang. From recall to forgetting: Benchmarking long-term memory for personalized agents.arXiv preprint arXiv:2604.20006, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[32]

MIRIX: Multi-Agent Memory System for LLM-Based Agents

Yu Wang and Xi Chen. Mirix: Multi-agent memory system for llm-based agents.arXiv preprint arXiv:2507.07957, 2025

work page internal anchor Pith review arXiv 2025
[33]

arXiv preprint arXiv:2509.25911 , year=

Yu Wang, Ryuichi Takanobu, Zhiqi Liang, Yuzhen Mao, Yuanzhe Hu, Julian McAuley, and Xiaojian Wu. Mem-α: Learning memory construction via reinforcement learning.arXiv preprint arXiv:2509.25911, 2025

work page arXiv 2025
[34]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Longmemeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, et al. Knowme-bench: Benchmarking person understanding for lifelong digital companions. arXiv preprint arXiv:2601.04745, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[36]

An agent-based adaptive medical dialogue service for personalized healthcare.Information Processing & Management, 62(3), 2025

Fangfang Xu et al. An agent-based adaptive medical dialogue service for personalized healthcare.Information Processing & Management, 62(3), 2025

work page 2025
[37]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025. 11 A preprint

work page internal anchor Pith review Pith/arXiv arXiv 2025
[38]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z Pan, et al. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning.arXiv preprint arXiv:2508.19828, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[39]

arXiv preprint arXiv:2507.02259 , year=

Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al. Memagent: Reshaping long-context llm with multi-conv rl-based memory agent.arXiv preprint arXiv:2507.02259, 2025

work page arXiv 2025
[40]

Mem-t: Densifying rewards for long-horizon memory agents.arXiv preprint arXiv:2601.23014, 2026

Yanwei Yue, Boci Peng, Xuanbo Fan, Jiaxin Guo, Qiankun Li, and Yan Zhang. Mem-t: Densifying rewards for long-horizon memory agents.arXiv preprint arXiv:2601.23014, 2026

work page arXiv 2026
[41]

Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory.arXiv preprint arXiv:2601.03192, 2026

Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Zhuo Li, Yujie Zheng, Weinan Zhang, Ying Wen, Zhiyu Li, et al. Memrl: Self-evolving agents via runtime reinforcement learning on episodic memory. arXiv preprint arXiv:2601.03192, 2026

work page arXiv 2026
[42]

Ama-bench: Evaluating long-horizon memory for agentic applications, 2026

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, et al. Ama-bench: Evaluating long-horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

work page arXiv 2026
[43]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024
[44]

WebArena: A Realistic Web Environment for Building Autonomous Agents

Shuyan Zhou, Frank F Xu, Hao Zhu, Xuhui Zhou, Robert Lo, Abishek Sridhar, Xianyi Cheng, Tianyue Ou, Yonatan Bisk, Daniel Fried, et al. Webarena: A realistic web environment for building autonomous agents.arXiv preprint arXiv:2307.13854, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[45]

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents

Zijian Zhou, Ao Qu, Zhaoxuan Wu, Sunghwan Kim, Alok Prakash, Daniela Rus, Jinhua Zhao, Bryan Kian Hsiang Low, and Paul Pu Liang. Mem1: Learning to synergize memory and reasoning for efficient long-horizon agents. arXiv preprint arXiv:2506.15841, 2025. 12 A preprint Table 4: Detailed statistics of MedMemoryBench. Original Dialogue Statistics QA Benchmark S...

work page internal anchor Pith review arXiv 2025
[46]

Provide the target entity name directly

work page
[47]

Keep the answer brief and precise

work page
[48]

Answer: TheEEMtemplate is designed for precise slot-level retrieval

Do not include lengthy explanations. Answer: TheEEMtemplate is designed for precise slot-level retrieval. It therefore emphasizes direct extraction of the target entity rather than extended explanation. TLA Answer Prompt Context: Based on <memory_source>, accurately answer the following question. Question: <question> Answer Requirements:

work page
[49]

If the question asks about a time, answer in YYYY-MM-DD format (e.g., 2024-01-15)

work page 2024
[50]

19 A preprint

If the question asks about an event at a specific time, clearly describe the event content and key details. 19 A preprint

work page
[51]

Keep the answer concise and directly grounded in memory. Answer: TheTLAtemplate explicitly constrains temporal questions to normalized date outputs whenever possible, while still allowing concise event descriptions when the query asks what happened at a particular time point. SUA Answer Prompt Context: Based on <memory_source>, accurately answer the follo...

work page
[52]

Describe the patient’s most recent status

work page
[53]

Reflect important changes over time when necessary

work page
[55]

Be concise and direct. Answer: TheSUAprompt emphasizes up-to-date patient status and trajectory-aware summarization, which is important for questions that ask for the latest condition rather than isolated historical facts. MQ Answer Prompt Context: Based on <memory_source>, and considering the patient’s allergy history, medical history, medications, and p...

work page
[56]

Select all correct options

work page
[57]

Output only the option letter(s), such as B or B,D

work page
[58]

Do not provide any explanation. Answer: TheMQtemplate enforces a strict multiple-choice output format, which simplifies automatic evaluation and avoids verbose justifications that are irrelevant to the benchmark target. IG Answer Prompt Context: Based on <memory_source>, and considering the patient’s allergy history, medical history, medications, and pers...

work page
[59]

Reason from this patient’s specific remembered information; do not give generic medical advice

work page
[60]

Maintain a warm yet professional tone

work page
[61]

Be concise, direct, and avoid boilerplate

work page
[62]

Answer: TheIGtemplate is designed for personalized medical inference

If recommending or advising against something, briefly explain the reason based on the patient’s specific situation. Answer: TheIGtemplate is designed for personalized medical inference. It explicitly discourages generic recommendations and instead requires patient-grounded reasoning based on remembered allergy history, medication use, prior diagnoses, an...

work page
[63]

Clearly list the memory content you draw upon

work page
[64]

Present a clear reasoning path from evidence to conclusions

work page
[65]

Answer: Finally, theMCDtemplate targets queries that require multi-visit synthesis and explicit reasoning over dispersed historical evidence

Provide a final comprehensive judgment. Answer: Finally, theMCDtemplate targets queries that require multi-visit synthesis and explicit reasoning over dispersed historical evidence. Compared with the other templates, it places the strongest emphasis on transparent reasoning paths and comprehensive judgment grounded in multiple memory items. D.2 LLM-as-Jud...

work page
[66]

The model must correctly provide the time point

Asking when a certain event occurred. The model must correctly provide the time point

work page
[67]

January 1, 2024

Asking what happened at a certain time. The model must correctly describe the event content Judge strictly: - If the model’s answer contains the correct time point or the correct event content, judge as [CORRECT] - If the model’s answer about the time/event does not match the reference answer or fails to answer, judge as [INCORRECT] - Date formats do not ...

work page 2024
[68]

Must be based on memory: The model’s answer must demonstrate the use of the patient’s past memory information, not guessing or generic medical knowledge

work page
[69]

coincidentally correct

No guessing allowed: If the model has not retrieved relevant memory information but gives a “coincidentally correct” answer, it should be judged as [INCORRECT]

work page
[70]

remembers

Information source requirement: A correct answer should convey that the model “remembers” this patient’s specific situation, rather than guessing. Judge strictly: - If the model’s answer demonstrates the use of patient historical memory and the core content is consistent with the reference answer, judge as [CORRECT] - If the model’s answer contains key in...

work page
[71]

Patient Information Utilization (Key) - The model must demonstrate the use of patient-specific information from memory - If required_patient_info is provided in metadata, the model’s answer must reflect understanding of these key pieces of information (important) - If the patient’s specific circumstances and past memories are ignored or missing, judge as ...

work page
[72]

common wrong answer

Reasoning Quality - The model must reason based on retrieved patient historical information, not purely from its own medical common sense - If only a conclusion is given without sufficient reference to patient information and memory, judge as [INCORRECT] - If the model gives a “common wrong answer” type of response (generic advice), judge as [INCORRECT]

work page
[73]

is_correct

Conclusion Correctness - The final recommendation/conclusion should be fully consistent with the reference answer in direction - Even if the conclusion is correct, if it lacks reasoning based on patient information, still judge as [INCORRECT] Judgment rules: - [CORRECT]: Answer uses patient-specific information, contains required patient information point...

work page
[74]

Poor blood sugar control may lead to

Patient-Specific Information Principle: The model must explicitly reference the patient’s specific data (such as specific test values, medication dosages, specific timing of symptom onset, particular diagnostic results), rather than giving generic medical common sense. - “Poor blood sugar control may lead to...”. This is generic medical knowledge, not pat...

work page
[75]

remembers

Memory Retrieval Evidence Principle: If the model fails to demonstrate specific references to the patient’s historical records, even if the reasoning direction is correct, it should be judged as inadequate. The model must show it “remembers” the patient’s specific situation

work page
[76]

Similar but different mechanisms cannot substitute

Strict Causal Chain Correspondence Principle: The causal relationships established by the model must precisely correspond to the causal mechanisms described in the reasoning chain nodes. Similar but different mechanisms cannot substitute

work page
[77]

TSH 0.02

Node Content Precise Matching Principle: During node verification, it is not sufficient to judge as “covered” merely because the model mentioned a related concept. You must verify whether the model referenced the core specific content within the node. Evaluation Steps Step 1: Strict Node-by-Node Check For each node in the reasoning chain, all of the follo...

work page 2024