pith. sign in

arxiv: 2603.23231 · v2 · pith:DYYTTHVJnew · submitted 2026-03-24 · 💻 cs.AI

PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments

Pith reviewed 2026-05-21 09:50 UTC · model grok-4.3

classification 💻 cs.AI
keywords personalized memorylong-term memorypersona consistencypreference evolutionAI agentsbenchmark evaluationmulti-domain interactionsevent-driven memory
0
0 comments X

The pith

Advanced memory systems improve preference extraction by connecting related interactions but fail to keep consistent personas over time and across topics.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper creates a new benchmark called PERMA to test how well AI memory systems maintain a user's consistent persona as preferences develop gradually through many conversations. It argues that real-world personalization happens when preferences build up across linked events in noisy, multi-topic settings, not just by pulling single facts from long chats. The benchmark adds realistic text changes and personal speech styles to make the test harder and more lifelike. Results show that smarter memory methods which link events get more accurate preferences while using fewer tokens than basic search of all past talks, but they still cannot hold a steady persona when time stretches and topics mix.

Core claim

PERMA is a benchmark that evaluates long-term memory in personalized agents by using sequences of temporally ordered interaction events across multiple sessions and domains. Preference-related queries are inserted over time, with text variability and linguistic alignment added to simulate real erratic user inputs and individual ways of speaking. Experiments reveal that memory systems linking related interactions can extract precise preferences more effectively and with lower token consumption than traditional semantic retrieval from raw dialogues, yet these systems still have difficulty maintaining coherent personas through temporal depth and cross-domain interference.

What carries the argument

The PERMA benchmark's event-driven setup, which organizes interactions into temporally ordered events spanning sessions and domains with inserted preference queries to test gradual preference accumulation and persona consistency.

Load-bearing premise

The constructed sequence of temporally ordered events with inserted queries and added variability truly captures how user preferences evolve gradually in real, noisy, multi-domain conversations.

What would settle it

A study comparing agent performance on PERMA tasks versus actual long-term user interactions with tracked preference changes would test if the benchmark's findings hold in practice; if real users show no advantage for event-linking, the claim weakens.

Figures

Figures reproduced from arXiv: 2603.23231 by Bo Tang, Chao Zhang, Derong Xu, Enhong Chen, Feiyu Xiong, Haotian Zhang, Jia Li, Junda Lin, Junyi Zhu, Long Shu, Shuochen Liu, Tong Xu, Yuhao Chen, Zhiyu Li.

Figure 1
Figure 1. Figure 1: Comparison of context construction and evaluation. ( [PITH_FULL_IMAGE:figures/full_fig_p003_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The PERMA pipeline for dialogue construction and evaluation. [PITH_FULL_IMAGE:figures/full_fig_p010_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Performance comparison of memory systems across evaluation checkpoints in the [PITH_FULL_IMAGE:figures/full_fig_p016_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Comprehensive comparison of model and memory system performance across Clean and Noise single-domain [PITH_FULL_IMAGE:figures/full_fig_p018_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: MCQ Acc. of standalone LLMs at different evaluation checkpoints. Results are categorized by single-domain (Left) and (Right) multi-domain settings. reported in PrefEval [80]. An example is MemOS, whose retrieval volume nearly doubles from 709.1 tokens (Clean) to 1486.7 tokens (Noise). This expanded context provides a more detailed description of user preferences, leading to an increase in Memory Score (2.2… view at source ↗
Figure 6
Figure 6. Figure 6: (Left) Accuracy (MCQ Acc.) across three evaluation checkpoints in the Clean setting (Multi), where the dashed line represents the baseline performance under information omission (Type 1). (Right) Memory Scores across the event types. (2) Cross-Domain Interference and Evaluation Limitations. As shown in [PITH_FULL_IMAGE:figures/full_fig_p019_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Analysis of retrieval depth (Top-k) on multi-domain performance. ( [PITH_FULL_IMAGE:figures/full_fig_p020_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Comprehensive comparison of model performance across Clean and Noise multi-domain scenarios: ( [PITH_FULL_IMAGE:figures/full_fig_p021_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Variation and performance gap of baselines in [PITH_FULL_IMAGE:figures/full_fig_p022_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: MCQ Acc. performance trends of approaches across different segment positions in single-domain tasks under Clean and Noisy settings. simulates large-scale interaction histories and stress-tests models’ ability to maintain effective memory under extreme context lengths. Following the pipeline described in Section 4.2.2, we align the user queries of the Clean setting with real-world conversational styles and… view at source ↗
Figure 11
Figure 11. Figure 11: (Left) Heatmap of MCQ Acc. across diverse users, highlighting how persona uniqueness influences system success regardless of context length. (Right) MCQ Acc. trends across positions in style-aligned long context settings. Qwen2.5-14B-1M, despite being fine-tuned on ultra-long contexts, exhibits a performance decline (from 0.766 to 0.716). Memory systems, however, maintain stable performance, highlighting … view at source ↗
Figure 12
Figure 12. Figure 12: Overall MCQ Acc. across all experimental configurations (Clean, In-session Noise, and Style-aligned Long-context): single-domain (Left) and multi-domain (Right) tasks. domains like Finance (0.914) and Messaging (0.980), where task requirements are relatively stable. Within the memory systems, MemOS narrows the gap between plug-in memory-based agents and vanilla models, particularly in the Shopping (0.889)… view at source ↗
Figure 13
Figure 13. Figure 13: TIMELINE_GENERATION ANSWER_OPTION_PROMPT You are an assistant specialized in answering multiple-choice questions. ## Your Memory {context} ## User Task Query {question} ## Options: {options} Your goal is to choose **the most appropriate answer option for the User Task Query** from the Options based on your memory. The output should be **ONLY the option key** without any additional explanation, e.g., ‘A‘, … view at source ↗
Figure 14
Figure 14. Figure 14: ANSWER_OPTION_PROMPT Manuscript submitted to ACM [PITH_FULL_IMAGE:figures/full_fig_p032_14.png] view at source ↗
Figure 15
Figure 15. Figure 15: DIALOGUE_GENERATION Manuscript submitted to ACM [PITH_FULL_IMAGE:figures/full_fig_p033_15.png] view at source ↗
Figure 16
Figure 16. Figure 16: USER_FEEDBACK_PROMPT ANSWER_INTERACTION_PROMPT You are a conversational AI assistant focused on creating natural, thorough, and personalized interactions to complete the user query. Below is the memory accumulated from your past interactions with this user ## Your Memory {context} ## User Task Query {question} ## Current Task Conversation History: {history} You need to provide a reply based on the user’s … view at source ↗
Figure 17
Figure 17. Figure 17: ANSWER_INTERACTION_PROMPT Manuscript submitted to ACM [PITH_FULL_IMAGE:figures/full_fig_p034_17.png] view at source ↗
Figure 18
Figure 18. Figure 18: OPTION_GENERATION_PROMPT Manuscript submitted to ACM [PITH_FULL_IMAGE:figures/full_fig_p035_18.png] view at source ↗
Figure 19
Figure 19. Figure 19: EVAL_MEMORY_SCORE Manuscript submitted to ACM [PITH_FULL_IMAGE:figures/full_fig_p036_19.png] view at source ↗
Figure 20
Figure 20. Figure 20: Detailed annotation in Label Studio for data quality assessment. The protocol evaluates 6 criteria, ranging from [PITH_FULL_IMAGE:figures/full_fig_p039_20.png] view at source ↗
read the original abstract

Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. Existing evaluations of this capability typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events driving user preference evolution. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems extract precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper introduces PERMA, a benchmark for evaluating long-term personalized memory in LLMs. It consists of temporally ordered multi-session, multi-domain interaction events with inserted preference-related queries, text variability, and linguistic alignment to better simulate gradual preference evolution in noisy real-world contexts. The authors design multiple-choice and interactive tasks to test persona understanding over time and compare memory systems that link related interactions against traditional semantic retrieval of raw dialogues. Experiments claim that linking-based systems extract more precise preferences, reduce token consumption, and outperform semantic retrieval, yet all systems still struggle to maintain coherent personas across temporal depth and cross-domain interference.

Significance. If the synthetic event sequences faithfully proxy gradual, implicit preference accumulation, the work usefully identifies limitations in current memory architectures for personalization and supplies an open benchmark (with code and data released at the cited GitHub repository) for future progress. The emphasis on event linking and token efficiency is a concrete, testable contribution.

major comments (1)
  1. [§3] §3 (PERMA Construction): The insertion of explicit preference-related queries into the temporally ordered events, combined with added text variability, risks generating artificially clean and detectable signals rather than the erratic, implicit, and gradual preference drift characteristic of real user data. This construction choice is load-bearing for the central claims about both the reported gains of linking-based memory and the reported failures in persona coherence; without additional validation (e.g., comparison to real user logs or human judgment of implicitness), the benchmark may not support the generalization that current systems 'still struggle' in realistic settings.
minor comments (2)
  1. [Abstract] Abstract and §4: No quantitative results, error bars, or statistical significance tests are reported for the preference-extraction or persona-coherence metrics, making it difficult to judge the practical magnitude of the claimed improvements over semantic retrieval.
  2. [§4.2] §4.2: The exact definitions and scoring rubrics for the interactive tasks (e.g., how persona coherence is measured across sessions) should be stated more formally, perhaps with an equation or pseudocode, to allow exact reproduction.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback on our manuscript. We address the major comment below and indicate the revisions made in response.

read point-by-point responses
  1. Referee: [§3] §3 (PERMA Construction): The insertion of explicit preference-related queries into the temporally ordered events, combined with added text variability, risks generating artificially clean and detectable signals rather than the erratic, implicit, and gradual preference drift characteristic of real user data. This construction choice is load-bearing for the central claims about both the reported gains of linking-based memory and the reported failures in persona coherence; without additional validation (e.g., comparison to real user logs or human judgment of implicitness), the benchmark may not support the generalization that current systems 'still struggle' in realistic settings.

    Authors: We appreciate the referee's point regarding the balance between controlled construction and real-world fidelity in PERMA. The explicit insertion of preference-related queries at varying temporal positions is intentional to capture gradual preference accumulation across sessions and domains, a feature absent from prior needle-in-a-haystack evaluations. Text variability and linguistic alignment were added precisely to model erratic inputs and idiolects, as stated in Section 3. We acknowledge that synthetic benchmarks cannot fully replicate the implicitness of private user logs. In the revised manuscript we have expanded Section 3 with additional justification of these design decisions, clarified their relation to observed real-world personalization challenges, and included a human evaluation study (new Appendix) in which annotators rate the implicitness and realism of the generated events. These changes provide further support for the benchmark while preserving its utility for isolating memory mechanisms. revision: partial

Circularity Check

0 steps flagged

No significant circularity in benchmark construction or empirical claims

full rationale

The paper introduces PERMA as an explicitly constructed benchmark with temporally ordered events, inserted preference queries, text variability, and linguistic alignment to address limitations in prior evaluations. Experimental results report direct comparisons between linking-based memory systems and semantic retrieval on these tasks, without any reduction of outcomes to fitted parameters, self-defined quantities, or load-bearing self-citations. The design choices are presented as independent methodological decisions to simulate gradual preference evolution, and the reported gains/failures are empirical observations on the defined tasks rather than derivations equivalent to the inputs by construction. This is self-contained against external benchmarks and matches the default expectation of no circularity.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The work introduces an evaluation framework rather than a mathematical derivation, resting primarily on domain assumptions about how user preferences form in practice.

axioms (1)
  • domain assumption Preferences emerge gradually and accumulate across interactions within noisy contexts.
    Explicitly stated in the abstract as the fundamental characteristic overlooked by existing evaluations.
invented entities (1)
  • PERMA benchmark no independent evidence
    purpose: To evaluate persona consistency over time in memory agents using event sequences and variability
    Newly constructed dataset and task suite introduced in this work.

pith-pipeline@v0.9.0 · 5802 in / 1169 out tokens · 55967 ms · 2026-05-21T09:50:07.153884+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 3 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. $\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

    cs.AI 2026-05 unverdicted novelty 7.0

    π-Bench is a new evaluation suite that jointly measures proactivity and task completion in AI agents across sustained multi-turn workflows containing hidden intents and cross-session continuity.

  2. $\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows

    cs.AI 2026-05 unverdicted novelty 7.0

    π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.

  3. Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data

    cs.AI 2026-04 unverdicted novelty 7.0

    MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.

Reference graph

Works this paper leans on

112 extracted references · 112 canonical work pages · cited by 2 Pith papers · 30 internal anchors

  1. [1]

    Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. 2025. MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems. arXiv:2510.17281 [cs.LG] https://arxiv.org/abs/2510.17281

  2. [2]

    Haonan Bian, Zhiyuan Yao, Sen Hu, Zishan Xu, Shaolei Zhang, Yifu Guo, Ziliang Yang, Xueran Han, Huacan Wang, and Ronghao Chen. 2026. RealMem: Benchmarking LLMs in Real-World Memory-Driven Interaction. arXiv:2601.06966 [cs.CL] https://arxiv.org/abs/2601.06966

  3. [3]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.NeurIPS33 (2020), 1877–1901

  4. [4]

    Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2023. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2309.07597 [cs.CL]

  5. [5]

    Yuhao Chen, Yuanjie Lyu, Shuochen Liu, Chao Zhang, Junhui Lv, and Tong Xu. 2025. Think Wider, Detect Sharper: Reinforced Reference Coverage for Document-Level Self-Contradiction Detection. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng...

  6. [6]

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413 [cs.CL] https://arxiv.org/abs/2504.19413

  7. [7]

    Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 [cs.CL] https://arxiv.org/abs/2507.06261

  8. [8]

    Pengfei Du. 2026. Memory for Autonomous LLM Agents:Mechanisms, Evaluation, and Emerging Frontiers. arXiv:2603.07670 [cs.AI] https: //arxiv.org/abs/2603.07670

  9. [9]

    Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. 2024. PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering. arXiv:2402.16288 [cs.CL] https://arxiv.org/abs/2402.16288

  10. [10]

    Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. 2025. LightMem: Lightweight and Efficient Memory-Augmented Generation. arXiv:2510.18866 [cs.CL] https: //arxiv.org/abs/2510.18866

  11. [11]

    Xueyang Feng, Weinan Gan, Xu Chen, Quanyu Dai, and Yong Liu. 2026. How Does Personalized Memory Shape LLM Behavior? Benchmarking Rational Preference Utilization in Personalized Assistants. arXiv:2601.16621 [cs.CL] https://arxiv.org/abs/2601.16621

  12. [12]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval- augmented generation for large language models: A survey.arXiv preprint arXiv:2312.109972 (2023). Manuscript submitted to ACM 28 Liu et al

  13. [13]

    GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, et al. 2026. GLM-5: from Vibe Coding to Agentic Engineering. arXiv:2602.15763 [cs.LG] https://arxiv.org/abs/2602.15763

  14. [15]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)

  15. [17]

    Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. 2025. From rag to memory: Non-parametric continual learning for large language models.arXiv preprint arXiv:2502.14802(2025)

  16. [18]

    Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, and Alex Pentland. 2026. MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks. arXiv:2602.16313 [cs.CL] https://arxiv.org/abs/2602.16313

  17. [19]

    Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, and Yafeng Deng

  18. [20]

    arXiv:2601.02163 [cs.AI] https: //arxiv.org/abs/2601.02163

    EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning. arXiv:2601.02163 [cs.AI] https: //arxiv.org/abs/2601.02163

  19. [21]

    Yuyang Hu, Shichun Liu, Yanwei Yue, et al. 2025. Memory in the Age of AI Agents. arXiv:2512.13564 [cs.CL] https://arxiv.org/abs/2512.13564

  20. [22]

    Yulin Hu, Zimo Long, Jiahe Guo, Xingyu Sui, Xing Fu, Weixiang Zhao, Yanyan Zhao, and Bing Qin. 2026. OP-Bench: Benchmarking Over- Personalization for Memory-Augmented Personalized Conversational Agents. arXiv:2601.13722 [cs.CL] https://arxiv.org/abs/2601.13722

  21. [23]

    Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)

  22. [24]

    Taylor, and Dan Roth

    Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J. Taylor, and Dan Roth. 2025. Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale. arXiv:2504.14225 [cs.CL] https: //arxiv.org/abs/2504.14225

  23. [25]

    Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, Radha Poovendran, Gregory Wornell, Lyle Ungar, Dan Roth, Sihao Chen, and Camillo Jose Taylor. 2025. PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory. arXiv:2512.06688 [cs.C...

  24. [26]

    Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large La...

  25. [27]

    Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat-Seng Chua. 2025. Hello Again! LLM-powered Personalized Agent for Long-term Dialogue. arXiv:2406.05925 [cs.CL] https://arxiv.org/abs/2406.05925

  26. [28]

    Xiaopeng Li, Pengyue Jia, Derong Xu, Yi Wen, Yingyi Zhang, Wenlin Zhang, Wanyu Wang, Yichao Wang, Zhaocheng Du, Xiangyang Li, Yong Liu, Huifeng Guo, Ruiming Tang, and Xiangyu Zhao. 2025. A Survey of Personalization: From RAG to Agent. arXiv:2504.10147 [cs.IR] https://arxiv.org/abs/2504.10147

  27. [29]

    Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. 2024. Personal llm agents: Insights and survey about the capability, efficiency and security.arXiv preprint arXiv:2401.05459(2024)

  28. [30]

    Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, et al. 2025. MemOS: A Memory OS for AI System. arXiv:2507.03724 [cs.CL] https://arxiv.org/abs/ 2507.03724

  29. [31]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 [cs.CL] https://arxiv.org/abs/2307.03172

  30. [32]

    Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, and Enhong Chen. 2025. Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning. arXiv:2511.12003 [cs.AI] https://arxiv.org/abs/2511.12003

  31. [33]

    Pengfei Luo, Jingbo Zhou, Tong Xu, Yuan Xia, Linli Xu, and Enhong Chen. 2025. ImageScope: Unifying Language-Guided Image Retrieval via Large Multimodal Model Collective Reasoning. arXiv:2503.10166 [cs.IR] https://arxiv.org/abs/2503.10166

  32. [34]

    Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, and Enhong Chen. 2025. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models.TOIS43, 2 (2025), 1–32

  33. [35]

    Shengjie Ma, Chengjin Xu, Xuhui Jiang, Muzhi Li, Huaren Qu, Cehao Yang, Jiaxin Mao, and Jian Guo. 2024. Think-on-graph 2.0: Deep and faithful large language model reasoning with knowledge-guided retrieval augmented generation.arXiv preprint arXiv:2407.10805(2024)

  34. [36]

    Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query rewriting in retrieval-augmented large language models. InEMNLP. 5303–5315

  35. [37]

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. arXiv:2402.17753 [cs.CL] https://arxiv.org/abs/2402.17753

  36. [38]

    Wenyu Mao, Haoyang Liu, Zhao Liu, Haosong Tan, Yaorui Shi, Jiancan Wu, An Zhang, and Xiang Wang. 2026. Collaborative Multi-Agent Optimization for Personalized Memory System. arXiv:2603.12631 [cs.MA] https://arxiv.org/abs/2603.12631 Manuscript submitted to ACM PERMA : Benchmarking Personalized Memory Agents 29

  37. [39]

    Abhiman Neelakanteswara, Shreyas Chaudhari, and Hamed Zamani. 2024. RAGs to Style: Personalizing LLMs with Style Embeddings. InProceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024), Ameet Deshpande, EunJeong Hwang, Vishvak Murahari, Joon Sung Park, Diyi Yang, Ashish Sabharwal, Karthik Narasimhan, and Ashwin Kalyan ...

  38. [40]

    OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL] https://arxiv.org/abs/2410.21276

  39. [41]

    OpenAI, Josh Achiam, Steven Adler, et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774

  40. [42]

    Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, and Huan Wang. 2025. UserBench: An Interactive Gym Environment for User-Centric Agents. arXiv:2507.22034 [cs.AI] https://arxiv.org/abs/2507.22034

  41. [43]

    Hongjin Qian, Zheng Liu, Peitian Zhang, Kelong Mao, Defu Lian, Zhicheng Dou, and Tiejun Huang. 2025. MemoRAG: Boosting Long Context Processing with Global Memory-Enhanced Retrieval Augmentation. arXiv:2409.05591 [cs.CL]

  42. [44]

    Kan Ren, Jiarui Qin, Yuchen Fang, Weinan Zhang, Lei Zheng, Weijie Bian, Guorui Zhou, Jian Xu, Yong Yu, Xiaoqiang Zhu, and Kun Gai. 2019. Lifelong Sequential Modeling with Personalized Memorization for User Response Prediction. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’19). ACM...

  43. [45]

    Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao. 2024. From Isolated Conversations to Hierarchical Schemas: Dynamic Tree Memory Representation for LLMs.arXiv preprint arXiv:2410.14052(2024)

  44. [46]

    Alaa Saleh, Sasu Tarkoma, Anders Lindgren, Praveen Kumar Donta, Schahram Dustdar, Susanna Pirttikangas, and Lauri Lovén. 2025. MemIndex: Agentic Event-based Distributed Memory Management for Multi-agent Systems.ACM Trans. Auton. Adapt. Syst.(Nov. 2025). doi:10.1145/3774946 Just Accepted

  45. [47]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. InThe Twelfth International Conference on Learning Representations

  46. [48]

    Yaorui Shi, Yuxin Chen, Siyuan Wang, Sihang Li, Hengxing Cai, Qi Gu, Xiang Wang, and An Zhang. 2026. Look Back to Reason Forward: Revisitable Memory for Long-Context LLM Agents. arXiv:2509.23040 [cs.CL] https://arxiv.org/abs/2509.23040

  47. [49]

    Juntao Tan, Liangwei Yang, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Tulika Manoj Awalgaonkar, Jianguo Zhang, Weiran Yao, Ming Zhu, Shirley Kokane, Silvio Savarese, Huan Wang, Caiming Xiong, and Shelby Heinecke. 2025. PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data. arXiv:2502.20616 [cs...

  48. [50]

    Juntao Tan, Liangwei Yang, Zuxin Liu, Zhiwei Liu, Rithesh RN, Tulika Manoj Awalgaonkar, Jianguo Zhang, Weiran Yao, Ming Zhu, Shirley Kokane, et al. 2025. Personabench: Evaluating ai models on understanding personal information through accessing (synthetic) private user data. InFindings of the Association for Computational Linguistics: ACL 2025. 878–893

  49. [51]

    Dawei Tao, Enqi Liu, Sidath Randeni Kadupitige, Michael Cahill, Alan Fekete, and Uwe Röhm. 2024. First Past the Post: Evaluating Query Optimization in MongoDB. arXiv:2409.16544 [cs.DB] https://arxiv.org/abs/2409.16544

  50. [52]

    Mohammad Tavakoli, Alireza Salemi, Carrie Ye, Mohamed Abdalla, Hamed Zamani, and J Ross Mitchell. 2026. Beyond a Million Tokens: Benchmarking and Enhancing Long-Term Memory in LLMs. arXiv:2510.27246 [cs.CL] https://arxiv.org/abs/2510.27246

  51. [53]

    Kimi Team, Tongtong Bai, Yifan Bai, et al. 2026. Kimi K2.5: Visual Agentic Intelligence. arXiv:2602.02276 [cs.CL] https://arxiv.org/abs/2602.02276

  52. [54]

    Haoye Tian, Chong Wang, BoYang Yang, Lyuye Zhang, and Yang Liu. 2025. A Taxonomy of Prompt Defects in LLM Systems. arXiv:2509.14404 [cs.SE] https://arxiv.org/abs/2509.14404

  53. [55]

    Jianguo Wang, Xiaomeng Yi, Rentong Guo, et al. 2021. Milvus: A Purpose-Built Vector Data Management System. InProceedings of the 2021 International Conference on Management of Data(Virtual Event, China)(SIGMOD ’21). Association for Computing Machinery, New York, NY , USA, 2614–2627. doi:10.1145/3448016.3457550

  54. [56]

    Shuting Wang, Xin Yu, Mang Wang, Weipeng Chen, Yutao Zhu, and Zhicheng Dou. 2025. RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation. InCOLING. 11317–11333

  55. [58]

    Yu Wang and Xi Chen. 2025. MIRIX: Multi-Agent Memory System for LLM-Based Agents. arXiv:2507.07957 [cs.CL] https://arxiv.org/abs/2507. 07957

  56. [59]

    Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, and Mark Steedman. 2025. MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly. arXiv:2505.10610 [cs.CV] https://arxiv.org/abs/2505.10610

  57. [60]

    Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2025. Agent Workflow Memory. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=NTAhi2JEEE

  58. [61]

    Peter West and Christopher Potts. 2025. Base Models Beat Aligned Models at Randomness and Creativity. InSecond Conference on Language Modeling. https://openreview.net/forum?id=vqN8uom4A1

  59. [62]

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. 2025. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. arXiv:2410.10813 [cs.CL] https://arxiv.org/abs/2410.10813

  60. [63]

    Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, and Ronghao Chen

  61. [64]

    KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions

    KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions. arXiv:2601.04745 [cs.AI] https://arxiv.org/abs/2601. 04745 Manuscript submitted to ACM 30 Liu et al

  62. [65]

    Derong Xu, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Maolin Wang, Qidong Liu, Xiangyu Zhao, Yichao Wang, Huifeng Guo, Ruiming Tang, et al

  63. [66]

    Align-GRAG: Reasoning-Guided Dual Alignment for Graph Retrieval-Augmented Generation.arXiv preprint arXiv:2505.16237(2025)

  64. [67]

    Derong Xu, Xinhang Li, Ziheng Zhang, Zhenxi Lin, Zhihong Zhu, Zhi Zheng, Xian Wu, Xiangyu Zhao, Tong Xu, and Enhong Chen. 2025. Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation. arXiv:2412.18537 [cs.CL] https://arxiv.org/abs/2412.18537

  65. [68]

    Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, wenlin zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, and Tong Xu

  66. [69]

    arXiv:2505.19549 [cs.CL] https://arxiv.org/abs/2505.19549

    From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents. arXiv:2505.19549 [cs.CL] https://arxiv.org/abs/2505.19549

  67. [70]

    Derong Xu, Ziheng Zhang, Zhenxi Lin, Xian Wu, Zhihong Zhu, Tong Xu, Xiangyu Zhao, Yefeng Zheng, and Enhong Chen. 2024. Multi-perspective Improvement of Knowledge Graph Completion with Large Language Models. InLREC/COLING

  68. [71]

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. A-MEM: Agentic Memory for LLM Agents. arXiv:2502.12110 [cs.CL] https://arxiv.org/abs/2502.12110

  69. [72]

    Yue Xu, Qian Chen, Zizhan Ma, Dongrui Liu, Wenxuan Wang, Xiting Wang, Li Xiong, and Wenjie Wang. 2026. Toward Personalized LLM-Powered Agents: Foundations, Evaluation, and Future Directions. arXiv:2602.22680 [cs.AI] https://arxiv.org/abs/2602.22680

  70. [73]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, V olker Tresp, and Yunpu Ma. 2026. Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning. arXiv:2508.19828 [cs.CL] https://arxiv.org/abs/2508.19828

  71. [74]

    An Yang, Anfeng Li, Baosong Yang, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  72. [75]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2.5 Technical Report.arXiv e-prints(2024), arXiv–2412

  73. [76]

    Howard Yen, Tianyu Gao, Minmin Hou, Ke Ding, Daniel Fleischer, Peter Izsak, Moshe Wasserblat, and Danqi Chen. 2025. HELMET: How to Evaluate Long-Context Language Models Effectively and Thoroughly. arXiv:2410.02694 [cs.CL] https://arxiv.org/abs/2410.02694

  74. [77]

    Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al

  75. [78]

    MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent.arXiv preprint arXiv:2507.02259(2025)

  76. [79]

    Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, and Xianpei Han. 2025. MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning.arXiv preprint arXiv:2511.02805(2025)

  77. [80]

    Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. [n. d.]. Inference Scaling for Long-Context Retrieval Augmented Generation. InICLR

  78. [81]

    Chao Zhang, Yuhao Wang, Derong Xu, Haoxin Zhang, Yuanjie Lyu, Yuhao Chen, Shuochen Liu, Tong Xu, Xiangyu Zhao, Yan Gao, et al. 2025. TeaRAG: A Token-Efficient Agentic Retrieval-Augmented Generation Framework.arXiv preprint arXiv:2511.05385(2025)

  79. [82]

    Xiaotian Zhang, Yuan Wang, Ruizhe Chen, Zeya Wang, Runchen Hou, and Zuozhu Liu. 2025. Towards Proactive Personalization through Profile Customization for Individual Users in Dialogues. arXiv:2512.15302 [cs.CL] https://arxiv.org/abs/2512.15302

  80. [83]

    Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, and Jitao Sang. 2025. Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks.arXiv preprint arXiv:2510.12635(2025)

Showing first 80 references.