pith. sign in

arxiv: 2605.25693 · v1 · pith:LWQUSG3Ynew · submitted 2026-05-25 · 💻 cs.CL · cs.DB· cs.MA

From Facts to Insights: A Persona-Driven Dual Memory Framework and Dataset for Role-Playing Agents

Pith reviewed 2026-06-29 21:49 UTC · model grok-4.3

classification 💻 cs.CL cs.DBcs.MA
keywords role-playing agentsmemory frameworkspersona fidelitydual memoryRoleMemo datasetlong-term conversationssupervised fine-tuningreinforcement learning
0
0 comments X

The pith

Dual memory separates facts from persona insights so small models sustain role-play consistency better than large fact-only systems.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Role-playing agents lose character consistency over long conversations when memory systems store facts without interpreting them through the persona. The paper builds RoleMemo, a dataset of four tasks requiring persona-guided reasoning from factual fragments, to expose failures in standard persona-agnostic summarization. It introduces DualMem to keep two separate memory streams—one factual and one that generates persona-conditioned insights. After SFT and RL training, a 4B-parameter model using DualMem achieves higher sustained persona fidelity than zero-shot persona-agnostic systems based on DeepSeek-V3.2.

Core claim

DualMem decouples memory into factual cognition and persona-conditioned insight streams. Trained with supervised fine-tuning followed by reinforcement learning, this structure lets a 4B-parameter model outperform zero-shot persona-agnostic frameworks powered by DeepSeek-V3.2 on sustained persona fidelity, as measured by the RoleMemo dataset where facts must be interpreted through the persona to answer correctly.

What carries the argument

DualMem, the framework that maintains two decoupled memory streams—one recording facts and the other deriving persona-specific insights from those facts.

If this is right

  • Role-playing agents equipped with dual memory streams produce fewer generic responses that break character.
  • A 4B model can exceed the persona fidelity of much larger models when memory includes explicit persona-conditioned interpretation.
  • The RoleMemo tasks serve as a targeted benchmark exposing limitations of fact-only memory systems.
  • Reinforcement learning after SFT further strengthens the model's ability to generate persona-appropriate insights from stored facts.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The dual-stream design may apply to other long-context personalization tasks where raw facts must be filtered through a fixed identity or style.
  • Real-world deployment would benefit from testing whether the RoleMemo advantage persists when conversations include user-driven topic shifts not present in the dataset.
  • Scaling the same dual-memory structure to larger base models could compound the fidelity gains observed with the 4B version.

Load-bearing premise

Performance on the four reasoning tasks in RoleMemo accurately reflects an agent's ability to maintain persona fidelity across unrestricted open-ended conversations.

What would settle it

Extended human-AI role-play sessions lasting dozens of turns where evaluators directly score persona consistency and check whether DualMem still shows an advantage over persona-agnostic baselines.

Figures

Figures reproduced from arXiv: 2605.25693 by Jiji Tang, Junnan Ren, Rongsheng Zhang, Ruofan Hu, Tangjie Lv, Tao Jin, Wanying Wu, Weijie Chen, Xunuoyan Chen, Zhou Zhao.

Figure 1
Figure 1. Figure 1: Persona-agnostic memory stores the neutral [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The RoleMemo dataset and DUALMEM framework. Left: The four-phase dataset construction pipeline. We generate queries that require agents to synthesize scattered facts via persona-driven insights to formulate answers. Right: Our proposed DUALMEM framework. It maintains a unified Memory Bank that decouples factual and insight cognition while preserving their links. We explore two training mechanisms on RoleMe… view at source ↗
Figure 3
Figure 3. Figure 3: Performance across conversation lengths from [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Example persona (Public-Interest Law), trans [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
read the original abstract

While role-playing agents excel in short-term interactions, long-term conversations overwhelm context windows, motivating external memory frameworks. Current systems typically rely on persona-agnostic summarization, which records facts without persona-specific interpretation, yielding generic responses that compromise persona fidelity. To bridge this gap, we introduce RoleMemo, a dataset featuring four reasoning tasks where the factual fragments must be interpreted through the persona to reach the correct answer. Evaluation on RoleMemo exposes critical limitations of persona-agnostic frameworks. We thus propose DualMem, which decouples memory into two streams: factual cognition and persona-conditioned insight. Trained through Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL), our framework with a 4B-parameter model outperforms zero-shot persona-agnostic frameworks powered by DeepSeek-V3.2 for sustained persona fidelity. Our resources are available at https://github.com/role2026/rolememo.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces RoleMemo, a dataset of four reasoning tasks that require interpreting factual fragments through a given persona to produce correct answers, to expose limitations of persona-agnostic memory frameworks in role-playing agents. It proposes DualMem, which decouples memory into factual cognition and persona-conditioned insight streams, trained via SFT and RL. The central empirical claim is that a 4B-parameter DualMem model outperforms zero-shot persona-agnostic baselines powered by DeepSeek-V3.2 on sustained persona fidelity as measured by RoleMemo accuracy.

Significance. If the result holds and RoleMemo is shown to be a valid proxy, the work would offer a concrete dataset and dual-stream architecture for mitigating persona drift in long conversations, addressing a practical limitation of context-window-based role-playing systems. The combination of SFT+RL training and the explicit separation of memory streams represents a targeted engineering contribution, though its broader impact hinges on validation against open-ended dialogue.

major comments (2)
  1. [Evaluation / Abstract] Evaluation section (and abstract claim): The assertion that DualMem 'outperforms ... for sustained persona fidelity' rests on accuracy gains on RoleMemo's four tasks. No correlation is reported between RoleMemo accuracy and human-judged persona consistency across multi-turn open-ended dialogues (e.g., emotional valence drift or spontaneous action consistency), which is the motivating failure mode. This leaves the central performance claim unsupported for the intended use case.
  2. [RoleMemo dataset description] § on RoleMemo construction: The four tasks are presented as a proxy for persona-conditioned inference, yet the manuscript provides no ablation or human study demonstrating that success on these short factual-interpretation items predicts fidelity in unconstrained long conversations. If the tasks only capture narrow reasoning and not the broader failure modes, the comparison to DeepSeek-V3.2 does not establish the claimed advantage.
minor comments (2)
  1. [Experimental setup] The GitHub link is provided but the manuscript does not specify the exact train/test splits, prompt templates, or statistical significance tests used in the RoleMemo experiments.
  2. [DualMem framework] Notation for the two memory streams (factual cognition vs. persona-conditioned insight) is introduced without an accompanying diagram or pseudocode, making the architectural distinction harder to follow.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments regarding the evaluation claims and the RoleMemo dataset. We address each major point below, acknowledging where the manuscript can be strengthened through clarification and revision.

read point-by-point responses
  1. Referee: [Evaluation / Abstract] Evaluation section (and abstract claim): The assertion that DualMem 'outperforms ... for sustained persona fidelity' rests on accuracy gains on RoleMemo's four tasks. No correlation is reported between RoleMemo accuracy and human-judged persona consistency across multi-turn open-ended dialogues (e.g., emotional valence drift or spontaneous action consistency), which is the motivating failure mode. This leaves the central performance claim unsupported for the intended use case.

    Authors: We agree that the manuscript does not report a direct correlation study between RoleMemo accuracy and human judgments of persona consistency in unconstrained multi-turn dialogues. RoleMemo was explicitly designed to isolate the persona-conditioned interpretation of factual fragments—the precise mechanism that produces drift when persona-agnostic memory is used—rather than to simulate full open-ended conversations. The four tasks require models to derive persona-specific insights that persona-agnostic baselines systematically miss, providing a controlled test of the core limitation. To address the concern, we will revise the abstract and evaluation section to state that DualMem outperforms baselines on RoleMemo as a proxy measure for sustained persona fidelity, and we will add an explicit discussion of the proxy's scope and limitations. revision: yes

  2. Referee: [RoleMemo dataset description] § on RoleMemo construction: The four tasks are presented as a proxy for persona-conditioned inference, yet the manuscript provides no ablation or human study demonstrating that success on these short factual-interpretation items predicts fidelity in unconstrained long conversations. If the tasks only capture narrow reasoning and not the broader failure modes, the comparison to DeepSeek-V3.2 does not establish the claimed advantage.

    Authors: The RoleMemo tasks were constructed so that each factual fragment yields a different correct answer depending on the persona; success therefore requires the model to perform persona-conditioned inference rather than generic fact recall. This directly targets the failure mode of persona-agnostic memory frameworks. While the manuscript does not include an ablation or human study linking short-task performance to long open-ended dialogues, the task design ensures that persona-agnostic systems fail by construction. We will expand the dataset description section with additional construction details and examples to better justify why these tasks serve as a meaningful proxy. A comprehensive human validation against unconstrained dialogues would require further experiments beyond the current work. revision: partial

Circularity Check

0 steps flagged

No circularity: purely empirical claims with no derivations

full rationale

The paper introduces RoleMemo (a dataset of four reasoning tasks) and DualMem (a dual-stream memory framework trained with SFT+RL) and reports empirical accuracy gains of a 4B model over DeepSeek-V3.2 baselines. No equations, first-principles derivations, parameter-fitting steps, or uniqueness theorems appear in the abstract or description. The performance claim rests on direct task accuracy rather than any reduction to fitted inputs or self-referential definitions, making the work self-contained as an empirical contribution.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract provides insufficient detail to enumerate specific free parameters or axioms; the performance claim implicitly rests on the unstated assumption that SFT+RL training on the new dataset produces generalizable persona fidelity gains.

pith-pipeline@v0.9.1-grok · 5721 in / 1111 out tokens · 22281 ms · 2026-06-29T21:49:11.427934+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

76 extracted references · 27 canonical work pages · 14 internal anchors

  1. [1]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  2. [2]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...

  3. [3]

    Anthropic. 2025. Claude. https://www.anthropic.com/claude

  4. [4]

    Frederic Charles Bartlett. 1995. Remembering: A study in experimental and social psychology. Cambridge university press

  5. [5]

    ByteDance . 2024. https://www.doubao.com Doubao: AI character chat . https://www.doubao.com. Accessed: 2026-03-13

  6. [6]

    Hongzhan Chen, Hehong Chen, Ming Yan, Wenshen Xu, Gao Xing, Weizhou Shen, Xiaojun Quan, Chenliang Li, Ji Zhang, and Fei Huang. 2024. Socialbench: Sociality evaluation of role-playing conversational agents. In Findings of the Association for Computational Linguistics: ACL 2024, pages 2108--2126

  7. [7]

    Tiantian Chen, Jiaqi Lu, Ying Shen, and Lin Zhang. 2026. Es-memeval: Benchmarking conversational agents on personalized long-term emotional support. In Proceedings of the ACM Web Conference 2026, pages 5810--5821

  8. [9]

    Nicholas Epley, Boaz Keysar, Leaf Van Boven, and Thomas Gilovich. 2004. Perspective taking as egocentric anchoring and adjustment. Journal of personality and social psychology, 87(3):327

  9. [12]

    Junqing He, Liang Zhu, Rui Wang, Xi Wang, Gholamreza Haffari, and Jiaxing Zhang. 2025. Madial-bench: Towards real-world evaluation of memory-augmented dialogue generation. In Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages...

  10. [17]

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating very long-term conversational memory of llm agents. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851--13870

  11. [19]

    OpenAI . 2025. GPT-5 . https://openai.com/index/gpt-5-1/. Accessed: 2026-03-17

  12. [20]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, and 1 others. 2022. Training language models to follow instructions with human feedback. Advances in neural information processing systems, 35:27730--27744

  13. [23]

    Joon Sung Park, Joseph O'Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S Bernstein. 2023. Generative agents: Interactive simulacra of human behavior. In Proceedings of the 36th annual acm symposium on user interface software and technology, pages 1--22

  14. [25]

    Shalom H Schwartz. 1992. Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries. In Advances in experimental social psychology, volume 25, pages 1--65. Elsevier

  15. [26]

    Noah Shinn, Federico Cassano, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. 2023. Reflexion: Language agents with verbal reinforcement learning. Advances in neural information processing systems, 36:8634--8652

  16. [28]

    Quan Tu, Shilong Fan, Zihang Tian, Tianhao Shen, Shuo Shang, Xin Gao, and Rui Yan. 2024. Charactereval: A chinese benchmark for role-playing conversational agent evaluation. In Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 11836--11850

  17. [30]

    Xintao Wang, Heng Wang, Yifei Zhang, Xinfeng Yuan, Rui Xu, Jen-tse Huang, Siyu Yuan, Haoran Guo, Jiangjie Chen, Shuchang Zhou, and 1 others. 2025 b . Coser: Coordinating llm-based persona simulation of established roles. In Forty-second International Conference on Machine Learning

  18. [37]

    Diji Yang, Linda Zeng, Jinmeng Rao, and Yi Zhang. 2025 b . Knowing you don't know: Learning when to continue search in multi-round rag through self-practicing. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, pages 1305--1315

  19. [41]

    Haonan Zhang, Run Luo, Xiong Liu, Yuchuan Wu, Ting-En Lin, Pengpeng Zeng, Qiang Qu, Feiteng Fang, Min Yang, Lianli Gao, and 1 others. 2025 a . Omnicharacter: Towards immersive role-playing agents with seamless speech-language personality interaction. In Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long...

  20. [43]

    Xinliang Frederick Zhang, Nick Beauchamp, and Lu Wang. 2025 b . Prime: Large language model personalization with cognitive dual-memory and personalized thought process. In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 33695--33724

  21. [45]

    Jinfeng Zhou, Yongkang Huang, Bosi Wen, Guanqun Bi, Yuxuan Chen, Pei Ke, Zhuang Chen, Xiyao Xiao, Libiao Peng, Kuntian Tang, and 1 others. 2025. Characterbench: benchmarking character customization of large language models. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 26101--26110

  22. [46]

    A. V. Aho and J. D. Ullman, The Theory of Parsing, Translation and Compiling. 1em plus 0.5em minus 0.4em Englewood Cliffs, NJ: Prentice-Hall, 1972, vol. 1

  23. [47]

    1em plus 0.5em minus 0.4em Washington, DC: American Psychological Association, 1983

    American Psychological Association , Publications Manual. 1em plus 0.5em minus 0.4em Washington, DC: American Psychological Association, 1983

  24. [48]

    A. K. Chandra, D. C. Kozen, and L. J. Stockmeyer, ``Alternation,'' Journal of the Association for Computing Machinery, vol. 28, no. 1, pp. 114--133, 1981

  25. [49]

    Andrew and J

    G. Andrew and J. Gao, ``Scalable training of L1 -regularized log-linear models,'' in Proceedings of the 24th International Conference on Machine Learning, 2007, pp. 33--40

  26. [50]

    Gusfield, Algorithms on Strings, Trees and Sequences

    D. Gusfield, Algorithms on Strings, Trees and Sequences. 1em plus 0.5em minus 0.4em Cambridge, UK: Cambridge University Press, 1997

  27. [51]

    M. S. Rasooli and J. R. Tetreault, ``Yara parser: A fast and accurate dependency parser,'' Computing Research Repository, vol. arXiv:1503.06733, 2015, version 2. [Online]. Available: http://arxiv.org/abs/1503.06733

  28. [52]

    R. K. Ando and T. Zhang, ``A framework for learning predictive structures from multiple tasks and unlabeled data,'' Journal of Machine Learning Research, vol. 6, pp. 1817--1853, Dec. 2005

  29. [53]

    Ouyang, J

    L. Ouyang, J. Wu, X. Jiang, D. Almeida, C. Wainwright, P. Mishkin, C. Zhang, S. Agarwal, K. Slama, A. Ray et al., ``Training language models to follow instructions with human feedback,'' Advances in neural information processing systems, vol. 35, pp. 27\,730--27\,744, 2022

  30. [54]

    D. Guo, D. Yang, H. Zhang, J. Song, P. Wang, Q. Zhu, R. Xu, R. Zhang, S. Ma, X. Bi et al., ``Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning,'' arXiv preprint arXiv:2501.12948, 2025

  31. [55]

    Y. Shao, L. Li, J. Dai, and X. Qiu, ``Character-llm: A trainable agent for role-playing,'' in Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 13\,153--13\,187

  32. [56]

    H. Li, C. Yang, A. Zhang, Y. Deng, X. Wang, and T.-S. Chua, ``Hello again! llm-powered personalized agent for long-term dialogue,'' in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 5259--5276

  33. [57]

    X. F. Zhang, N. Beauchamp, and L. Wang, ``Prime: Large language model personalization with cognitive dual-memory and personalized thought process,'' in Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, 2025, pp. 33\,695--33\,724

  34. [58]

    J. Zhou, Z. Chen, D. Wan, B. Wen, Y. Song, J. Yu, Y. Huang, L. Peng, J. Yang, X. Xiao et al., ``Characterglm: Customizing chinese conversational ai characters with large language models,'' arXiv preprint arXiv:2311.16832, 2023

  35. [59]

    X. Wang, H. Wang, Y. Zhang, X. Yuan, R. Xu, J.-t. Huang, S. Yuan, H. Guo, J. Chen, S. Zhou et al., ``Coser: Coordinating llm-based persona simulation of established roles,'' in Forty-second International Conference on Machine Learning, 2025

  36. [60]

    Zhang, R

    H. Zhang, R. Luo, X. Liu, Y. Wu, T.-E. Lin, P. Zeng, Q. Qu, F. Fang, M. Yang, L. Gao et al., ``Omnicharacter: Towards immersive role-playing agents with seamless speech-language personality interaction,'' in Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2025, pp. 26\,318--26\,331

  37. [61]

    Zerhoudi and M

    S. Zerhoudi and M. Granitzer, ``Personarag: Enhancing retrieval-augmented generation systems with user-centric agents,'' arXiv preprint arXiv:2407.09394, 2024

  38. [62]

    Z. Wang, Z. Li, Z. Jiang, D. Tu, and W. Shi, ``Crafting personalized agents through retrieval-augmented generation on editable memory graphs,'' arXiv preprint arXiv:2409.19401, 2024

  39. [63]

    D. Yang, L. Zeng, J. Rao, and Y. Zhang, ``Knowing you don't know: Learning when to continue search in multi-round rag through self-practicing,'' in Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval, 2025, pp. 1305--1315

  40. [64]

    Y. Wang, J. Leung, and Z. Shen, ``Rolerag: Enhancing llm role-playing via graph guided retrieval,'' arXiv preprint arXiv:2505.18541, 2025

  41. [65]

    J. Park, Y. Han, M. Kim, and K. Yang, ``Dynamic context adaptation for consistent role-playing agents with retrieval-augmented generations,'' arXiv preprint arXiv:2508.02016, 2025

  42. [66]

    Zhang, X

    N. Zhang, X. Yang, Z. Tan, W. Deng, and W. Wang, ``Himem: Hierarchical long-term memory for llm long-horizon agents,'' arXiv preprint arXiv:2601.06377, 2026

  43. [67]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    P. Chhikara, D. Khant, S. Aryan, T. Singh, and D. Yadav, ``Mem0: Building production-ready ai agents with scalable long-term memory,'' arXiv preprint arXiv:2504.19413, 2025

  44. [68]

    P. Wang, M. Tian, J. Li, Y. Liang, Y. Wang, Q. Chen, T. Wang, Z. Lu, J. Ma, Y. E. Jiang et al., ``O-mem: Omni memory system for personalized, long horizon, self-evolving agents,'' arXiv preprint arXiv:2511.13593, 2025

  45. [69]

    S. Kim, Y. Lee, S. Kim, H. Kim, and S. Cho, ``Pre-storage reasoning for episodic memory: Shifting inference burden to memory for personalized dialogue,'' arXiv preprint arXiv:2509.10852, 2025

  46. [70]

    MIRIX: Multi-Agent Memory System for LLM-Based Agents

    Y. Wang and X. Chen, ``Mirix: Multi-agent memory system for llm-based agents,'' arXiv preprint arXiv:2507.07957, 2025

  47. [71]

    J. Fang, X. Deng, H. Xu, Z. Jiang, Y. Tang, Z. Xu, S. Deng, Y. Yao, M. Wang, S. Qiao et al., ``Lightmem: Lightweight and efficient memory-augmented generation,'' arXiv preprint arXiv:2510.18866, 2025

  48. [72]

    J. Liu, Y. Su, P. Xia, S. Han, Z. Zheng, C. Xie, M. Ding, and H. Yao, ``Simplemem: Efficient lifelong memory for llm agents,'' arXiv preprint arXiv:2601.02553, 2026

  49. [73]

    Y. Wang, R. Takanobu, Z. Liang, Y. Mao, Y. Hu, J. McAuley, and X. Wu, ``Mem- \ alpha \ : Learning memory construction via reinforcement learning,'' arXiv preprint arXiv:2509.25911, 2025

  50. [74]

    Maharana, D.-H

    A. Maharana, D.-H. Lee, S. Tulyakov, M. Bansal, F. Barbieri, and Y. Fang, ``Evaluating very long-term conversational memory of llm agents,'' in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 13\,851--13\,870

  51. [75]

    D. Wu, H. Wang, W. Yu, Y. Zhang, K.-W. Chang, and D. Yu, ``Longmemeval: Benchmarking chat assistants on long-term interactive memory,'' arXiv preprint arXiv:2410.10813, 2024

  52. [76]

    Pakhomov, E

    E. Pakhomov, E. Nijkamp, and C. Xiong, ``Convomem benchmark: Why your first 150 conversations don't need rag,'' arXiv preprint arXiv:2511.10523, 2025

  53. [77]

    Y. Hu, Y. Wang, and J. McAuley, ``Evaluating memory in llm agents via incremental multi-turn interactions,'' arXiv preprint arXiv:2507.05257, 2025

  54. [78]

    Q. Tu, S. Fan, Z. Tian, T. Shen, S. Shang, X. Gao, and R. Yan, ``Charactereval: A chinese benchmark for role-playing conversational agent evaluation,'' in Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 11\,836--11\,850

  55. [79]

    H. Chen, H. Chen, M. Yan, W. Xu, G. Xing, W. Shen, X. Quan, C. Li, J. Zhang, and F. Huang, ``Socialbench: Sociality evaluation of role-playing conversational agents,'' in Findings of the Association for Computational Linguistics: ACL 2024, 2024, pp. 2108--2126

  56. [80]

    J. Zhou, Y. Huang, B. Wen, G. Bi, Y. Chen, P. Ke, Z. Chen, X. Xiao, L. Peng, K. Tang et al., ``Characterbench: benchmarking character customization of large language models,'' in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 24, 2025, pp. 26\,101--26\,110

  57. [81]

    Nelson, G

    E. Nelson, G. Kollias, P. Das, S. Chaudhury, and S. Dan, ``Needle in the haystack for memory based large language models,'' arXiv preprint arXiv:2407.01437, 2024

  58. [82]

    H. Yu, T. Chen, J. Feng, J. Chen, W. Dai, Q. Yu, Y.-Q. Zhang, W.-Y. Ma, J. Liu, M. Wang et al., ``Memagent: Reshaping long-context llm with multi-conv rl-based memory agent,'' arXiv preprint arXiv:2507.02259, 2025

  59. [83]

    A. D. Galinsky and G. B. Moskowitz, ``Perspective-taking: decreasing stereotype expression, stereotype accessibility, and in-group favoritism.'' Journal of personality and social psychology, vol. 78, no. 4, p. 708, 2000

  60. [84]

    Epley, B

    N. Epley, B. Keysar, L. Van Boven, and T. Gilovich, ``Perspective taking as egocentric anchoring and adjustment.'' Journal of personality and social psychology, vol. 87, no. 3, p. 327, 2004

  61. [85]

    A. Yang, A. Li, B. Yang, B. Zhang, B. Hui, B. Zheng, B. Yu, C. Gao, C. Huang, C. Lv et al., ``Qwen3 technical report,'' arXiv preprint arXiv:2505.09388, 2025

  62. [86]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Y. Zhang, M. Li, D. Long, X. Zhang, H. Lin, B. Yang, P. Xie, A. Yang, D. Liu, J. Lin et al., ``Qwen3 embedding: Advancing text embedding and reranking through foundation models,'' arXiv preprint arXiv:2506.05176, 2025

  63. [87]

    A. Liu, A. Mei, B. Lin, B. Xue, B. Wang, B. Xu, B. Wu, B. Zhang, C. Lin, C. Dong et al., ``Deepseek-v3. 2: Pushing the frontier of open large language models,'' arXiv preprint arXiv:2512.02556, 2025

  64. [88]

    OpenAI , `` GPT-5 ,'' https://openai.com/index/gpt-5-1/, 2025, accessed: 2026-03-17

  65. [89]

    [Online]

    ByteDance , ``Doubao: AI character chat,'' https://www.doubao.com, 2024, accessed: 2026-03-13. [Online]. Available: https://www.doubao.com

  66. [90]

    G. Team, R. Anil, S. Borgeaud, J.-B. Alayrac, J. Yu, R. Soricut, J. Schalkwyk, A. M. Dai, A. Hauth, K. Millican et al., ``Gemini: a family of highly capable multimodal models,'' arXiv preprint arXiv:2312.11805, 2023

  67. [91]

    S. H. Schwartz, ``Universals in the content and structure of values: Theoretical advances and empirical tests in 20 countries,'' in Advances in experimental social psychology. 1em plus 0.5em minus 0.4em Elsevier, 1992, vol. 25, pp. 1--65

  68. [92]

    J. He, L. Zhu, R. Wang, X. Wang, G. Haffari, and J. Zhang, ``Madial-bench: Towards real-world evaluation of memory-augmented dialogue generation,'' in Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 9902--9921

  69. [93]

    T. Chen, J. Lu, Y. Shen, and L. Zhang, ``Es-memeval: Benchmarking conversational agents on personalized long-term emotional support,'' in Proceedings of the ACM Web Conference 2026, 2026, pp. 5810--5821

  70. [94]

    Shinn, F

    N. Shinn, F. Cassano, A. Gopinath, K. Narasimhan, and S. Yao, ``Reflexion: Language agents with verbal reinforcement learning,'' Advances in neural information processing systems, vol. 36, pp. 8634--8652, 2023

  71. [95]

    Y. Zhai, S. Tao, C. Chen, A. Zou, Z. Chen, Q. Fu, S. Mai, L. Yu, J. Deng, Z. Cao et al., ``Agentevolver: Towards efficient self-evolving agent system,'' arXiv preprint arXiv:2511.10395, 2025

  72. [96]

    J. S. Park, J. O'Brien, C. J. Cai, M. R. Morris, P. Liang, and M. S. Bernstein, ``Generative agents: Interactive simulacra of human behavior,'' in Proceedings of the 36th annual acm symposium on user interface software and technology, 2023, pp. 1--22

  73. [97]

    X. Wu, K. Li, Y. Zhao, L. Zhang, L. Ou, H. Yin, Z. Zhang, X. Yu, D. Zhang, Y. Jiang et al., ``Resum: Unlocking long-horizon search intelligence via context summarization,'' arXiv preprint arXiv:2509.13313, 2025

  74. [98]

    J. Qiu, X. Qi, H. Wang, X. Juan, Y. Wang, Z. Zhao, J. Geng, J. Guo, P. Li, J. Shi et al., ``Alita-g: Self-evolving generative agent for agent generation,'' arXiv preprint arXiv:2510.23601, 2025

  75. [99]

    F. C. Bartlett, Remembering: A study in experimental and social psychology. 1em plus 0.5em minus 0.4em Cambridge university press, 1995

  76. [100]

    Anthropic, ``Claude,'' https://www.anthropic.com/claude, 2025