PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
Pith reviewed 2026-05-21 09:50 UTC · model grok-4.3
The pith
Advanced memory systems improve preference extraction by connecting related interactions but fail to keep consistent personas over time and across topics.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
PERMA is a benchmark that evaluates long-term memory in personalized agents by using sequences of temporally ordered interaction events across multiple sessions and domains. Preference-related queries are inserted over time, with text variability and linguistic alignment added to simulate real erratic user inputs and individual ways of speaking. Experiments reveal that memory systems linking related interactions can extract precise preferences more effectively and with lower token consumption than traditional semantic retrieval from raw dialogues, yet these systems still have difficulty maintaining coherent personas through temporal depth and cross-domain interference.
What carries the argument
The PERMA benchmark's event-driven setup, which organizes interactions into temporally ordered events spanning sessions and domains with inserted preference queries to test gradual preference accumulation and persona consistency.
Load-bearing premise
The constructed sequence of temporally ordered events with inserted queries and added variability truly captures how user preferences evolve gradually in real, noisy, multi-domain conversations.
What would settle it
A study comparing agent performance on PERMA tasks versus actual long-term user interactions with tracked preference changes would test if the benchmark's findings hold in practice; if real users show no advantage for event-linking, the claim weakens.
Figures
read the original abstract
Empowering large language models with long-term memory is crucial for building agents that adapt to users' evolving needs. Existing evaluations of this capability typically interleave preference-related dialogues with irrelevant conversations, reducing the task to needle-in-a-haystack retrieval while ignoring relationships between events driving user preference evolution. Such settings overlook a fundamental characteristic of real-world personalization: preferences emerge gradually and accumulate across interactions within noisy contexts. To bridge this gap, we introduce PERMA, a benchmark designed to evaluate persona consistency over time beyond static preference recall. Additionally, we incorporate (1) text variability and (2) linguistic alignment to simulate erratic user inputs and individual idiolects in real-world data. PERMA consists of temporally ordered interaction events spanning multiple sessions and domains, with preference-related queries inserted over time. We design both multiple-choice and interactive tasks to probe the model's understanding of persona along the interaction timeline. Experiments demonstrate that by linking related interactions, advanced memory systems extract precise preferences and reduce token consumption, outperforming traditional semantic retrieval of raw dialogues. Nevertheless, they still struggle to maintain a coherent persona across temporal depth and cross-domain interference, highlighting the need for more robust personalized memory management in agents. Our code and data are open-sourced at https://github.com/PolarisLiu1/PERMA.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces PERMA, a benchmark for evaluating long-term personalized memory in LLMs. It consists of temporally ordered multi-session, multi-domain interaction events with inserted preference-related queries, text variability, and linguistic alignment to better simulate gradual preference evolution in noisy real-world contexts. The authors design multiple-choice and interactive tasks to test persona understanding over time and compare memory systems that link related interactions against traditional semantic retrieval of raw dialogues. Experiments claim that linking-based systems extract more precise preferences, reduce token consumption, and outperform semantic retrieval, yet all systems still struggle to maintain coherent personas across temporal depth and cross-domain interference.
Significance. If the synthetic event sequences faithfully proxy gradual, implicit preference accumulation, the work usefully identifies limitations in current memory architectures for personalization and supplies an open benchmark (with code and data released at the cited GitHub repository) for future progress. The emphasis on event linking and token efficiency is a concrete, testable contribution.
major comments (1)
- [§3] §3 (PERMA Construction): The insertion of explicit preference-related queries into the temporally ordered events, combined with added text variability, risks generating artificially clean and detectable signals rather than the erratic, implicit, and gradual preference drift characteristic of real user data. This construction choice is load-bearing for the central claims about both the reported gains of linking-based memory and the reported failures in persona coherence; without additional validation (e.g., comparison to real user logs or human judgment of implicitness), the benchmark may not support the generalization that current systems 'still struggle' in realistic settings.
minor comments (2)
- [Abstract] Abstract and §4: No quantitative results, error bars, or statistical significance tests are reported for the preference-extraction or persona-coherence metrics, making it difficult to judge the practical magnitude of the claimed improvements over semantic retrieval.
- [§4.2] §4.2: The exact definitions and scoring rubrics for the interactive tasks (e.g., how persona coherence is measured across sessions) should be stated more formally, perhaps with an equation or pseudocode, to allow exact reproduction.
Simulated Author's Rebuttal
We thank the referee for their constructive and detailed feedback on our manuscript. We address the major comment below and indicate the revisions made in response.
read point-by-point responses
-
Referee: [§3] §3 (PERMA Construction): The insertion of explicit preference-related queries into the temporally ordered events, combined with added text variability, risks generating artificially clean and detectable signals rather than the erratic, implicit, and gradual preference drift characteristic of real user data. This construction choice is load-bearing for the central claims about both the reported gains of linking-based memory and the reported failures in persona coherence; without additional validation (e.g., comparison to real user logs or human judgment of implicitness), the benchmark may not support the generalization that current systems 'still struggle' in realistic settings.
Authors: We appreciate the referee's point regarding the balance between controlled construction and real-world fidelity in PERMA. The explicit insertion of preference-related queries at varying temporal positions is intentional to capture gradual preference accumulation across sessions and domains, a feature absent from prior needle-in-a-haystack evaluations. Text variability and linguistic alignment were added precisely to model erratic inputs and idiolects, as stated in Section 3. We acknowledge that synthetic benchmarks cannot fully replicate the implicitness of private user logs. In the revised manuscript we have expanded Section 3 with additional justification of these design decisions, clarified their relation to observed real-world personalization challenges, and included a human evaluation study (new Appendix) in which annotators rate the implicitness and realism of the generated events. These changes provide further support for the benchmark while preserving its utility for isolating memory mechanisms. revision: partial
Circularity Check
No significant circularity in benchmark construction or empirical claims
full rationale
The paper introduces PERMA as an explicitly constructed benchmark with temporally ordered events, inserted preference queries, text variability, and linguistic alignment to address limitations in prior evaluations. Experimental results report direct comparisons between linking-based memory systems and semantic retrieval on these tasks, without any reduction of outcomes to fitted parameters, self-defined quantities, or load-bearing self-citations. The design choices are presented as independent methodological decisions to simulate gradual preference evolution, and the reported gains/failures are empirical observations on the defined tasks rather than derivations equivalent to the inputs by construction. This is self-contained against external benchmarks and matches the default expectation of no circularity.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Preferences emerge gradually and accumulate across interactions within noisy contexts.
invented entities (1)
-
PERMA benchmark
no independent evidence
Forward citations
Cited by 3 Pith papers
-
$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
π-Bench is a new evaluation suite that jointly measures proactivity and task completion in AI agents across sustained multi-turn workflows containing hidden intents and cross-session continuity.
-
$\pi$-Bench: Evaluating Proactive Personal Assistant Agents in Long-Horizon Workflows
π-Bench is a new benchmark for evaluating proactive personal assistant agents on 100 multi-turn tasks that include hidden intents, inter-task dependencies, and cross-session continuity.
-
Memory-Augmented LLM-based Multi-Agent System for Automated Feature Generation on Tabular Data
MALMAS is a memory-augmented multi-agent LLM system that generates diverse, high-quality features for tabular data via agent decomposition, routing, and iterative memory-guided refinement.
Reference graph
Works this paper leans on
-
[1]
Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. 2025. MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems. arXiv:2510.17281 [cs.LG] https://arxiv.org/abs/2510.17281
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [2]
-
[3]
Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, et al. 2020. Language models are few-shot learners.NeurIPS33 (2020), 1877–1901
work page 2020
-
[4]
Jianlv Chen, Shitao Xiao, Peitian Zhang, Kun Luo, Defu Lian, and Zheng Liu. 2023. BGE M3-Embedding: Multi-Lingual, Multi-Functionality, Multi-Granularity Text Embeddings Through Self-Knowledge Distillation. arXiv:2309.07597 [cs.CL]
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[5]
Yuhao Chen, Yuanjie Lyu, Shuochen Liu, Chao Zhang, Junhui Lv, and Tong Xu. 2025. Think Wider, Detect Sharper: Reinforced Reference Coverage for Document-Level Self-Contradiction Detection. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng...
-
[6]
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. 2025. Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. arXiv:2504.19413 [cs.CL] https://arxiv.org/abs/2504.19413
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[7]
Gheorghe Comanici, Eric Bieber, Mike Schaekermann, et al. 2025. Gemini 2.5: Pushing the Frontier with Advanced Reasoning, Multimodality, Long Context, and Next Generation Agentic Capabilities. arXiv:2507.06261 [cs.CL] https://arxiv.org/abs/2507.06261
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [8]
-
[9]
Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. 2024. PerLTQA: A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Synthesis in Question Answering. arXiv:2402.16288 [cs.CL] https://arxiv.org/abs/2402.16288
-
[10]
Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. 2025. LightMem: Lightweight and Efficient Memory-Augmented Generation. arXiv:2510.18866 [cs.CL] https: //arxiv.org/abs/2510.18866
work page internal anchor Pith review arXiv 2025
- [11]
-
[12]
Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, Haofen Wang, and Haofen Wang. 2023. Retrieval- augmented generation for large language models: A survey.arXiv preprint arXiv:2312.109972 (2023). Manuscript submitted to ACM 28 Liu et al
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[13]
GLM-5-Team, :, Aohan Zeng, Xin Lv, Zhenyu Hou, et al. 2026. GLM-5: from Vibe Coding to Agentic Engineering. arXiv:2602.15763 [cs.LG] https://arxiv.org/abs/2602.15763
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[15]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, et al. 2024. The llama 3 herd of models.arXiv preprint arXiv:2407.21783(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[17]
Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. 2025. From rag to memory: Non-parametric continual learning for large language models.arXiv preprint arXiv:2502.14802(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[18]
Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, Jiaxin Pei, Julian McAuley, Yejin Choi, and Alex Pentland. 2026. MemoryArena: Benchmarking Agent Memory in Interdependent Multi-Session Agentic Tasks. arXiv:2602.16313 [cs.CL] https://arxiv.org/abs/2602.16313
-
[19]
Chuanrui Hu, Xingze Gao, Zuyi Zhou, Dannong Xu, Yi Bai, Xintong Li, Hui Zhang, Tong Li, Chong Zhang, Lidong Bing, and Yafeng Deng
-
[20]
arXiv:2601.02163 [cs.AI] https: //arxiv.org/abs/2601.02163
EverMemOS: A Self-Organizing Memory Operating System for Structured Long-Horizon Reasoning. arXiv:2601.02163 [cs.AI] https: //arxiv.org/abs/2601.02163
-
[21]
Yuyang Hu, Shichun Liu, Yanwei Yue, et al. 2025. Memory in the Age of AI Agents. arXiv:2512.13564 [cs.CL] https://arxiv.org/abs/2512.13564
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [22]
-
[23]
Aaron Jaech, Adam Kalai, Adam Lerer, Adam Richardson, Ahmed El-Kishky, Aiden Low, Alec Helyar, Aleksander Madry, Alex Beutel, Alex Carney, et al. 2024. Openai o1 system card.arXiv preprint arXiv:2412.16720(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[24]
Bowen Jiang, Zhuoqun Hao, Young-Min Cho, Bryan Li, Yuan Yuan, Sihao Chen, Lyle Ungar, Camillo J. Taylor, and Dan Roth. 2025. Know Me, Respond to Me: Benchmarking LLMs for Dynamic User Profiling and Personalized Responses at Scale. arXiv:2504.14225 [cs.CL] https: //arxiv.org/abs/2504.14225
-
[25]
Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, Radha Poovendran, Gregory Wornell, Lyle Ungar, Dan Roth, Sihao Chen, and Camillo Jose Taylor. 2025. PersonaMem-v2: Towards Personalized Intelligence via Learning Implicit User Personas and Agentic Memory. arXiv:2512.06688 [cs.C...
-
[26]
Hannah Rose Kirk, Alexander Whitefield, Paul Röttger, Andrew Bean, Katerina Margatina, Juan Ciro, Rafael Mosquera, Max Bartolo, Adina Williams, He He, Bertie Vidgen, and Scott A. Hale. 2024. The PRISM Alignment Dataset: What Participatory, Representative and Individualised Human Feedback Reveals About the Subjective and Multicultural Alignment of Large La...
- [27]
-
[28]
Xiaopeng Li, Pengyue Jia, Derong Xu, Yi Wen, Yingyi Zhang, Wenlin Zhang, Wanyu Wang, Yichao Wang, Zhaocheng Du, Xiangyang Li, Yong Liu, Huifeng Guo, Ruiming Tang, and Xiangyu Zhao. 2025. A Survey of Personalization: From RAG to Agent. arXiv:2504.10147 [cs.IR] https://arxiv.org/abs/2504.10147
-
[29]
Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. 2024. Personal llm agents: Insights and survey about the capability, efficiency and security.arXiv preprint arXiv:2401.05459(2024)
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[30]
Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, et al. 2025. MemOS: A Memory OS for AI System. arXiv:2507.03724 [cs.CL] https://arxiv.org/abs/ 2507.03724
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[31]
Lost in the Middle: How Language Models Use Long Contexts
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2023. Lost in the Middle: How Language Models Use Long Contexts. arXiv:2307.03172 [cs.CL] https://arxiv.org/abs/2307.03172
work page internal anchor Pith review Pith/arXiv arXiv 2023
-
[32]
Shuochen Liu, Pengfei Luo, Chao Zhang, Yuhao Chen, Haotian Zhang, Qi Liu, Xin Kou, Tong Xu, and Enhong Chen. 2025. Look as You Think: Unifying Reasoning and Visual Evidence Attribution for Verifiable Document RAG via Reinforcement Learning. arXiv:2511.12003 [cs.AI] https://arxiv.org/abs/2511.12003
- [33]
-
[34]
Yuanjie Lyu, Zhiyu Li, Simin Niu, Feiyu Xiong, Bo Tang, Wenjin Wang, Hao Wu, Huanyong Liu, Tong Xu, and Enhong Chen. 2025. Crud-rag: A comprehensive chinese benchmark for retrieval-augmented generation of large language models.TOIS43, 2 (2025), 1–32
work page 2025
- [35]
-
[36]
Xinbei Ma, Yeyun Gong, Pengcheng He, Hai Zhao, and Nan Duan. 2023. Query rewriting in retrieval-augmented large language models. InEMNLP. 5303–5315
work page 2023
-
[37]
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. arXiv:2402.17753 [cs.CL] https://arxiv.org/abs/2402.17753
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[38]
Wenyu Mao, Haoyang Liu, Zhao Liu, Haosong Tan, Yaorui Shi, Jiancan Wu, An Zhang, and Xiang Wang. 2026. Collaborative Multi-Agent Optimization for Personalized Memory System. arXiv:2603.12631 [cs.MA] https://arxiv.org/abs/2603.12631 Manuscript submitted to ACM PERMA : Benchmarking Personalized Memory Agents 29
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[39]
Abhiman Neelakanteswara, Shreyas Chaudhari, and Hamed Zamani. 2024. RAGs to Style: Personalizing LLMs with Style Embeddings. InProceedings of the 1st Workshop on Personalization of Generative AI Systems (PERSONALIZE 2024), Ameet Deshpande, EunJeong Hwang, Vishvak Murahari, Joon Sung Park, Diyi Yang, Ashish Sabharwal, Karthik Narasimhan, and Ashwin Kalyan ...
-
[40]
OpenAI. 2024. GPT-4o System Card. arXiv:2410.21276 [cs.CL] https://arxiv.org/abs/2410.21276
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[41]
OpenAI, Josh Achiam, Steven Adler, et al. 2024. GPT-4 Technical Report. arXiv:2303.08774 [cs.CL] https://arxiv.org/abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[42]
Cheng Qian, Zuxin Liu, Akshara Prabhakar, Zhiwei Liu, Jianguo Zhang, Haolin Chen, Heng Ji, Weiran Yao, Shelby Heinecke, Silvio Savarese, Caiming Xiong, and Huan Wang. 2025. UserBench: An Interactive Gym Environment for User-Centric Agents. arXiv:2507.22034 [cs.AI] https://arxiv.org/abs/2507.22034
- [43]
-
[44]
Kan Ren, Jiarui Qin, Yuchen Fang, Weinan Zhang, Lei Zheng, Weijie Bian, Guorui Zhou, Jian Xu, Yong Yu, Xiaoqiang Zhu, and Kun Gai. 2019. Lifelong Sequential Modeling with Personalized Memorization for User Response Prediction. InProceedings of the 42nd International ACM SIGIR Conference on Research and Development in Information Retrieval (SIGIR ’19). ACM...
- [45]
-
[46]
Alaa Saleh, Sasu Tarkoma, Anders Lindgren, Praveen Kumar Donta, Schahram Dustdar, Susanna Pirttikangas, and Lauri Lovén. 2025. MemIndex: Agentic Event-based Distributed Memory Management for Multi-agent Systems.ACM Trans. Auton. Adapt. Syst.(Nov. 2025). doi:10.1145/3774946 Just Accepted
-
[47]
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D Manning. 2024. RAPTOR: Recursive Abstractive Processing for Tree-Organized Retrieval. InThe Twelfth International Conference on Learning Representations
work page 2024
- [48]
-
[49]
Juntao Tan, Liangwei Yang, Zuxin Liu, Zhiwei Liu, Rithesh Murthy, Tulika Manoj Awalgaonkar, Jianguo Zhang, Weiran Yao, Ming Zhu, Shirley Kokane, Silvio Savarese, Huan Wang, Caiming Xiong, and Shelby Heinecke. 2025. PersonaBench: Evaluating AI Models on Understanding Personal Information through Accessing (Synthetic) Private User Data. arXiv:2502.20616 [cs...
-
[50]
Juntao Tan, Liangwei Yang, Zuxin Liu, Zhiwei Liu, Rithesh RN, Tulika Manoj Awalgaonkar, Jianguo Zhang, Weiran Yao, Ming Zhu, Shirley Kokane, et al. 2025. Personabench: Evaluating ai models on understanding personal information through accessing (synthetic) private user data. InFindings of the Association for Computational Linguistics: ACL 2025. 878–893
work page 2025
- [51]
- [52]
-
[53]
Kimi Team, Tongtong Bai, Yifan Bai, et al. 2026. Kimi K2.5: Visual Agentic Intelligence. arXiv:2602.02276 [cs.CL] https://arxiv.org/abs/2602.02276
work page internal anchor Pith review Pith/arXiv arXiv 2026
- [54]
-
[55]
Jianguo Wang, Xiaomeng Yi, Rentong Guo, et al. 2021. Milvus: A Purpose-Built Vector Data Management System. InProceedings of the 2021 International Conference on Management of Data(Virtual Event, China)(SIGMOD ’21). Association for Computing Machinery, New York, NY , USA, 2614–2627. doi:10.1145/3448016.3457550
-
[56]
Shuting Wang, Xin Yu, Mang Wang, Weipeng Chen, Yutao Zhu, and Zhicheng Dou. 2025. RichRAG: Crafting Rich Responses for Multi-faceted Queries in Retrieval-Augmented Generation. InCOLING. 11317–11333
work page 2025
-
[58]
Yu Wang and Xi Chen. 2025. MIRIX: Multi-Agent Memory System for LLM-Based Agents. arXiv:2507.07957 [cs.CL] https://arxiv.org/abs/2507. 07957
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[59]
Zhaowei Wang, Wenhao Yu, Xiyu Ren, Jipeng Zhang, Yu Zhao, Rohit Saxena, Liang Cheng, Ginny Wong, Simon See, Pasquale Minervini, Yangqiu Song, and Mark Steedman. 2025. MMLongBench: Benchmarking Long-Context Vision-Language Models Effectively and Thoroughly. arXiv:2505.10610 [cs.CV] https://arxiv.org/abs/2505.10610
-
[60]
Zora Zhiruo Wang, Jiayuan Mao, Daniel Fried, and Graham Neubig. 2025. Agent Workflow Memory. InForty-second International Conference on Machine Learning. https://openreview.net/forum?id=NTAhi2JEEE
work page 2025
-
[61]
Peter West and Christopher Potts. 2025. Base Models Beat Aligned Models at Randomness and Creativity. InSecond Conference on Language Modeling. https://openreview.net/forum?id=vqN8uom4A1
work page 2025
-
[62]
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. 2025. LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory. arXiv:2410.10813 [cs.CL] https://arxiv.org/abs/2410.10813
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[63]
Tingyu Wu, Zhisheng Chen, Ziyan Weng, Shuhe Wang, Chenglong Li, Shuo Zhang, Sen Hu, Silin Wu, Qizhen Lan, Huacan Wang, and Ronghao Chen
-
[64]
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions
KnowMe-Bench: Benchmarking Person Understanding for Lifelong Digital Companions. arXiv:2601.04745 [cs.AI] https://arxiv.org/abs/2601. 04745 Manuscript submitted to ACM 30 Liu et al
work page internal anchor Pith review Pith/arXiv arXiv
-
[65]
Derong Xu, Pengyue Jia, Xiaopeng Li, Yingyi Zhang, Maolin Wang, Qidong Liu, Xiangyu Zhao, Yichao Wang, Huifeng Guo, Ruiming Tang, et al
- [66]
-
[67]
Derong Xu, Xinhang Li, Ziheng Zhang, Zhenxi Lin, Zhihong Zhu, Zhi Zheng, Xian Wu, Xiangyu Zhao, Tong Xu, and Enhong Chen. 2025. Harnessing Large Language Models for Knowledge Graph Question Answering via Adaptive Multi-Aspect Retrieval-Augmentation. arXiv:2412.18537 [cs.CL] https://arxiv.org/abs/2412.18537
-
[68]
Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, wenlin zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, and Tong Xu
-
[69]
arXiv:2505.19549 [cs.CL] https://arxiv.org/abs/2505.19549
From Single to Multi-Granularity: Toward Long-Term Memory Association and Selection of Conversational Agents. arXiv:2505.19549 [cs.CL] https://arxiv.org/abs/2505.19549
-
[70]
Derong Xu, Ziheng Zhang, Zhenxi Lin, Xian Wu, Zhihong Zhu, Tong Xu, Xiangyu Zhao, Yefeng Zheng, and Enhong Chen. 2024. Multi-perspective Improvement of Knowledge Graph Completion with Large Language Models. InLREC/COLING
work page 2024
-
[71]
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. 2025. A-MEM: Agentic Memory for LLM Agents. arXiv:2502.12110 [cs.CL] https://arxiv.org/abs/2502.12110
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [72]
-
[73]
Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, V olker Tresp, and Yunpu Ma. 2026. Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning. arXiv:2508.19828 [cs.CL] https://arxiv.org/abs/2508.19828
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[74]
An Yang, Anfeng Li, Baosong Yang, et al. 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[75]
An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, et al. 2024. Qwen2.5 Technical Report.arXiv e-prints(2024), arXiv–2412
work page 2024
- [76]
-
[77]
Hongli Yu, Tinghong Chen, Jiangtao Feng, Jiangjie Chen, Weinan Dai, Qiying Yu, Ya-Qin Zhang, Wei-Ying Ma, Jingjing Liu, Mingxuan Wang, et al
-
[78]
MemAgent: Reshaping Long-Context LLM with Multi-Conv RL-based Memory Agent.arXiv preprint arXiv:2507.02259(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[79]
Qianhao Yuan, Jie Lou, Zichao Li, Jiawei Chen, Yaojie Lu, Hongyu Lin, Le Sun, Debing Zhang, and Xianpei Han. 2025. MemSearcher: Training LLMs to Reason, Search and Manage Memory via End-to-End Reinforcement Learning.arXiv preprint arXiv:2511.02805(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[80]
Zhenrui Yue, Honglei Zhuang, Aijun Bai, Kai Hui, Rolf Jagerman, Hansi Zeng, Zhen Qin, Dong Wang, Xuanhui Wang, and Michael Bendersky. [n. d.]. Inference Scaling for Long-Context Retrieval Augmented Generation. InICLR
- [81]
- [82]
-
[83]
Yuxiang Zhang, Jiangming Shu, Ye Ma, Xueyuan Lin, Shangxi Wu, and Jitao Sang. 2025. Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks.arXiv preprint arXiv:2510.12635(2025)
work page internal anchor Pith review Pith/arXiv arXiv 2025
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.