GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

Evgeniy Gabrilovich; Jingbo Yang; Kwei-Herng Lai; Shiyu Chang; Xiaowen Wang; Yaar Harari

arxiv: 2605.14498 · v2 · pith:PFSVRTVMnew · submitted 2026-05-14 · 💻 cs.CL

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

Jingbo Yang , Kwei-Herng Lai , Xiaowen Wang , Shiyu Chang , Yaar Harari , Evgeniy Gabrilovich This is my paper

Pith reviewed 2026-05-20 21:26 UTC · model grok-4.3

classification 💻 cs.CL

keywords LLM agent memorymulti-party conversationsgroup dynamicsbelief trackingaudience adaptationadversarial query generationbenchmark evaluation

0 comments

The pith

Multi-party conversations break current LLM agent memory systems because they erase speaker identities and audience adaptations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that existing memory systems for LLM agents were built only for single-user exchanges and therefore miss essential features of real group interactions. It identifies three unmeasured aspects: dynamics that arise only when multiple users converse with each other and the agent, separate tracking of what each participant believes, and shifts in wording that depend on who is being addressed. To expose these gaps the authors built a synthesis method that creates conversations from graphs while conditioning each message on individual personas and intended audiences, then paired it with an adversarial question generator that produces queries across six categories. The resulting benchmark shows that leading memory approaches lose the structural and lexical cues group memory requires, leaving performance far below what single-user tests suggested.

Core claim

GroupMemBench shows that LLM memory systems designed for dyadic chats lose the structural and lexical features required for multi-party settings, so that even the strongest system achieves only modest accuracy while a simple term-matching baseline often matches or exceeds specialized agent memories.

What carries the argument

Graph-grounded synthesis pipeline that builds multi-party conversations with controllable reply structure, per-user personas, and target audiences, then binds each question to a specific asker through adversarial search.

If this is right

Memory ingestion must retain speaker identity and reply structure rather than flattening conversations into a single stream.
Retrieval must support separate belief states for each participant instead of a shared pool.
Systems must adapt output vocabulary according to the audience present in the current exchange.
Knowledge-update and term-ambiguity tasks become the primary bottlenecks once group structure is present.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Future agent designs may need an explicit conversation graph layer on top of current vector or key-value stores.
The same synthesis approach could be reused to create training data that teaches models to maintain per-user memory models.
Benchmarks that ignore audience adaptation may systematically overestimate readiness for collaborative workplace use.

Load-bearing premise

The generated conversations and questions accurately reflect the structure and demands of real multi-party interactions without artifacts that systematically favor or penalize particular memory designs.

What would settle it

Running the same memory systems on transcripts from actual deployed group-chat agents and measuring whether error patterns match the benchmark categories would test whether the synthetic data captures real failure modes.

Figures

Figures reproduced from arXiv: 2605.14498 by Evgeniy Gabrilovich, Jingbo Yang, Kwei-Herng Lai, Shiyu Chang, Xiaowen Wang, Yaar Harari.

**Figure 2.** Figure 2: Overview of the GroupMemBench data synthesis pipeline . [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: G-Eval scores across six dimensions. Our graph-guided synthesis (four domains) closely tracks the real-world upper bound and substantially outperforms the single-prompt baseline. Scores averaged over 10 seeds; shaded bands indicate ±1 std. Quality Assessment. We adapt G-Eval [31] to the group-chat setting and assess synthesized dialogues along six dimensions chosen to reflect properties specific to mult… view at source ↗

**Figure 4.** Figure 4: Performance–efficiency trade-off across the four domains. Each marker is one of six [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

**Figure 5.** Figure 5: (Left) Failure-mode decomposition: each baseline’s 185 non-abstention questions per domain split into correct, reasoning failure, and retrieval failure. (Right) Retrieval recall vs. answer accuracy. Markers on the diagonal are retrieval-bottlenecked; below the diagonal indicates reasoning loss (gold surfaced but answered wrong); above indicates the system answered correctly without the gold message, typica… view at source ↗

**Figure 6.** Figure 6: P(correct | gold retrieved) per (baseline, question type). Factoring out retriever quality isolates each memory representation’s reasoning ability. Lexical shifts are the only failure that survives retrieval (Q3). Factoring out retriever quality ( [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

Large Language Model (LLM) agents increasingly serve as personal assistants and workplace collaborators, where their utility depends on memory systems that extract, retrieve, and apply information across long-running conversations. However, both existing memory systems and benchmarks are built around the dyadic, single-user setup, even though real deployments routinely span groups and channels with multiple users interacting with the agent and with each other. This mismatch leaves three properties of group memory unmeasured: (i) group dynamics that go beyond concatenated one-on-one chats, (ii) speaker-grounded belief tracking, where the per-user memory modeling is needed, and (iii) audience-adapted language, where Theory-of-Mind shifts produce role-specific vocabulary. We introduce GroupMemBench, a benchmark that exposes all three. A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences. An adversarial query pipeline then binds every question to a specific asker across six categories, spanning multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention, and iteratively searches challenging, realistic queries that reflect comprehensive memory capability. Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy, with knowledge update at 27.1% and term ambiguity at 37.7%, while a simple BM25 baseline matches or exceeds most agent memory systems. This indicates current memory ingestion erases the structural and lexical features group memory depends on, leaving multi-user memory far from solved.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

GroupMemBench shows current memory systems top out at 46% on multi-party chats with BM25 competitive, but the synthetic pipeline lacks real-data checks.

read the letter

The main point is that LLM memory systems collapse on group conversations, with the best reaching only 46% average accuracy while a basic BM25 retriever keeps up with most of them. The paper does something new by targeting group-specific memory issues that dyadic benchmarks ignore. It builds conversations using a graph-grounded pipeline that incorporates per-user personas and target audiences for each message. Then it creates adversarial queries in six categories, including knowledge updates and term ambiguity, to probe speaker-grounded tracking and audience adaptation. This approach works well for showing how current memory methods erase the features that matter in multi-party settings. The breakdown by category gives concrete evidence of the shortfalls. A potential issue is that the synthesis pipeline has not been checked against real group chat data. Without ratings for how natural the conversations feel or side-by-side comparisons to actual logs, it's possible the benchmark introduces biases that affect the results. The central claim about a sharp collapse might not hold as strongly on organic interactions. Readers working on LLM agents for workplace or collaborative use will get the most from this. It provides a way to measure progress on memory that handles multiple users. The paper shows clear thinking in defining the unmeasured properties and building a controllable test. It should go to peer review so others can examine the query generation and synthesis details.

Referee Report

2 major / 2 minor

Summary. The paper introduces GroupMemBench to evaluate LLM agent memory in multi-party conversations, arguing that existing dyadic benchmarks miss group dynamics, speaker-grounded belief tracking, and audience-adapted language. It describes a graph-grounded synthesis pipeline that generates conversations conditioned on per-user personas and target audiences with controllable reply structures, followed by an adversarial query pipeline that produces questions across six categories (multi-hop reasoning, knowledge update, term ambiguity, user-implicit reasoning, temporal reasoning, and abstention). Benchmarking leading memory systems yields a maximum average accuracy of 46.0%, with particularly low scores on knowledge update (27.1%) and term ambiguity (37.7%); a simple BM25 baseline matches or exceeds most specialized systems, indicating that current memory ingestion erases structural and lexical features required for group memory.

Significance. If the synthetic benchmark faithfully captures real multi-party dynamics, the work would be significant for exposing a clear gap in LLM agent memory for collaborative settings and for supplying a reproducible evaluation framework with adversarial queries and a competitive non-LLM baseline. The concrete accuracy numbers and category breakdowns provide falsifiable targets that could guide future architecture design. The approach of binding queries to specific askers and using iterative search for challenging examples is a strength that distinguishes it from simpler concatenation-based evaluations.

major comments (2)

[§3] §3 (Graph-Grounded Synthesis Pipeline): No human ratings of realism, no side-by-side comparison with anonymized real group-chat corpora (e.g., Slack or Discord archives), and no quantitative checks on turn-taking regularity or persona explicitness are reported. Because the central claim—that current memory systems suffer a sharp collapse on group memory—rests on the benchmark measuring genuine multi-user capabilities rather than synthesis artifacts, this omission is load-bearing and requires either added validation experiments or explicit discussion of why such checks are unnecessary.
[Table 1 / §5] Table 1 (or equivalent results table) and §5 (Experiments): The per-category accuracies (e.g., 27.1% on knowledge update) are presented without error bars, statistical significance tests against the BM25 baseline, or ablation on query-generation hyperparameters. This weakens the strength of the conclusion that BM25 “matches or exceeds most agent memory systems,” as it is unclear whether observed differences are robust or sensitive to the adversarial search procedure.

minor comments (2)

[§4] The six query categories are listed in the abstract and §4 but would benefit from a short table or bullet list with one-sentence definitions and an example query for each to improve readability.
[Figures] Figure captions (e.g., the pipeline diagram) should explicitly state the number of conversations, messages, and queries generated so readers can assess scale without searching the text.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive comments, which help improve the clarity and rigor of our work. We address each major comment below and indicate the revisions we will make to the manuscript.

read point-by-point responses

Referee: [§3] §3 (Graph-Grounded Synthesis Pipeline): No human ratings of realism, no side-by-side comparison with anonymized real group-chat corpora (e.g., Slack or Discord archives), and no quantitative checks on turn-taking regularity or persona explicitness are reported. Because the central claim—that current memory systems suffer a sharp collapse on group memory—rests on the benchmark measuring genuine multi-user capabilities rather than synthesis artifacts, this omission is load-bearing and requires either added validation experiments or explicit discussion of why such checks are unnecessary.

Authors: We agree that validating the realism of the synthesized conversations is crucial to substantiate our central claims. In the revised version, we will incorporate human evaluation results for a sample of the generated conversations, including ratings on aspects such as coherence, adherence to personas, and naturalness of group interactions. We will also provide quantitative comparisons of metrics like turn-taking patterns and persona explicitness against available real-world multi-party conversation datasets. This addition will address the concern directly. revision: yes
Referee: [Table 1 / §5] Table 1 (or equivalent results table) and §5 (Experiments): The per-category accuracies (e.g., 27.1% on knowledge update) are presented without error bars, statistical significance tests against the BM25 baseline, or ablation on query-generation hyperparameters. This weakens the strength of the conclusion that BM25 “matches or exceeds most agent memory systems,” as it is unclear whether observed differences are robust or sensitive to the adversarial search procedure.

Authors: We recognize that including statistical measures would enhance the robustness of our experimental results. Accordingly, in the revision we will add error bars to the accuracy figures in Table 1, conduct statistical significance tests (e.g., McNemar's test or t-tests) against the BM25 baseline, and include an ablation study on the hyperparameters of the adversarial query generation pipeline. These changes will provide stronger support for the observed performance differences. revision: yes

Circularity Check

0 steps flagged

Empirical benchmark evaluation with no derivational circularity

full rationale

The paper constructs GroupMemBench via a graph-grounded synthesis pipeline that generates multi-party conversations conditioned on per-user personas and target audiences, followed by an adversarial query pipeline across six categories. It then empirically evaluates existing memory systems on the resulting data, reporting concrete accuracies (e.g., 46.0% max, 27.1% on knowledge update) and direct comparison against an external BM25 baseline. No equations, parameter-fitting steps, self-citations, or uniqueness theorems appear in the provided text; the central claims rest on observable performance differences rather than any reduction of outputs to inputs by construction. The work is self-contained against external baselines and generated test cases.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The central claim rests on the assumption that the generated conversations and queries are representative of real group memory demands; no free parameters, axioms, or invented entities are described in the abstract.

pith-pipeline@v0.9.0 · 5833 in / 1075 out tokens · 47379 ms · 2026-05-20T21:26:33.940348+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

A graph-grounded synthesis pipeline produces multi-party conversations with controllable reply structure and conditions each message on per-user personas and target audiences.
IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Benchmarking leading memory systems exposes a sharp collapse: the strongest one reaches only 46.0% average accuracy

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

47 extracted references · 47 canonical work pages · 11 internal anchors

[1]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[2]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[3]

Trust in ai chatbots: A systematic review.Telematics and Informatics, 97:102240, 2025

Sheryl Wei Ting Ng and Renwen Zhang. Trust in ai chatbots: A systematic review.Telematics and Informatics, 97:102240, 2025

work page 2025
[4]

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about the capability, efficiency and security.arXiv preprint arXiv:2401.05459, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[5]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024
[6]

An introduction to microsoft copilot

Jess Stratton. An introduction to microsoft copilot. InCopilot for Microsoft 365: harness the power of generative AI in the Microsoft apps you use every day, pages 19–35. Springer, 2024

work page 2024
[7]

Rethinking memory mechanisms of foundation agents in the second half.arXiv preprint arXiv:2602.06052, 2026

Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, et al. Rethinking memory mechanisms of foundation agents in the second half.arXiv preprint arXiv:2602.06052, 2026

work page arXiv 2026
[8]

Latimer, N

Chris Latimer, Nicoló Boschi, Andrew Neeser, Chris Bartholomew, Gaurav Srivastava, Xuan Wang, and Naren Ramakrishnan. Hindsight is 20/20: Building agent memory that retains, recalls, and reflects.arXiv preprint arXiv:2512.12818, 2025

work page arXiv 2025
[9]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[10]

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

work page arXiv 2026
[11]

Ama-bench: Evaluating long-horizon memory for agentic llms,

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, et al. Ama-bench: Evaluating long- horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

work page internal anchor Pith review arXiv 2026
[12]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[13]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[14]

Theory of mind.Current biology, 15(17):R644–R645, 2005

Chris Frith and Uta Frith. Theory of mind.Current biology, 15(17):R644–R645, 2005. 10

work page 2005
[15]

Fantom: A benchmark for stress-testing machine theory of mind in interactions

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023

work page 2023
[16]

Tomato: Verbalizing the mental states of role-playing llms for benchmarking theory of mind

Kazutoshi Shinoda, Nobukatsu Hojo, Kyosuke Nishida, Saki Mizuno, Keita Suzuki, Ryo Masumura, Hiroaki Sugiyama, and Kuniko Saito. Tomato: Verbalizing the mental states of role-playing llms for benchmarking theory of mind. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 1520–1528, 2025

work page 2025
[17]

Grounding in communication

Herbert H Clark and Susan E Brennan. Grounding in communication. 1991

work page 1991
[18]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020
[19]

Hipporag: Neurobio- logically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024

Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobio- logically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024

work page 2024
[20]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

work page 2024
[21]

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[22]

Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Dannong Xu, Yi Bai, Tianwei Lin, Xinda Zhao, Xiaohong Li, Jiaqi An, et al. Evermembench: Benchmarking long-term interactive memory in large language modelsevermembench: Benchmarking long-term interactive memory in large language models.arXiv preprint arXiv:2602.01313, 2026

work page arXiv 2026
[23]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024
[24]

A survey on the memory mechanism of large language model-based agents

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

work page 2025
[25]

In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents

Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, et al. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8416–8439, 2025

work page 2025
[26]

Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al. Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

work page arXiv 2025
[27]

MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents

Dongming Jiang, Yi Li, Guanpeng Li, and Bingzhe Li. Magma: A multi-graph based agentic memory architecture for ai agents.arXiv preprint arXiv:2601.03236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[28]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023

work page 2023
[29]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[30]

Membench: Towards more comprehensive evaluation on the memory of llm-based agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025. 11

work page 2025
[31]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, 2023

work page 2023
[32]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. Appendix A Graph Schema Node types and attributes.The synthesis graph G contains four s...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[33]

If the suffix contains any ofincorrect,wrong, ornot correct, the verdict isIncorrect

work page
[34]

Otherwise, if the suffix containscorrect, the verdict isCorrect

work page
[35]

Who do I need aligned on formatting rules for the mitigation plan in theRisk: Formatting Inconsistenciesphase?

Otherwise, the verdict is recorded asUnclearand excluded from the accuracy denominator. Negative phrases are checked first becausenot correct is a substring of the positive trigger; the implementation is ineval_lib.py(lines 146–152). Reliability check.We manually re-examined 100 (question, gold answer, predicted answer, judge verdict) tuples sampled from ...

work page 2025
[37]

User_7 / Data Analyst / Risk: Formatting Inconsistencies

work page
[38]

User_13 / Compliance Officer / 2025-07-19 (Msg_1545)

work page 2025
[39]

Finance and Data Engineering

User_13 / Compliance Officer / 2025-07-23 (Msg_28294)← answer here Agent answer:“Finance and Data Engineering.” Why it works:the gpt-5 agent reads the full top-10 context and surfaces the correct phrasing from rank 7. The pipeline survives because nothing was rewritten—it just relied on a longer effective window than BM25 did. hindsight ✓ Correct(LLM-rewr...

work page 2025
[40]

Author: User_7 / Data Analyst / phase=Risk: Formatting Inconsistencies (early)

work page
[41]

Author: User_7 / Data Analyst / phase=Risk: Formatting Inconsistencies (mid)

work page
[42]

Agent answer:(empty) The agent declines to answer because none of the retrieved User_7 posts name a counterparty for User_13

Author: User_7 / Data Analyst / phase=Risk: Formatting Inconsistencies (late) What was lost.Speaker identity isn’t physically erased ( Author: User_7 is in every retrieved memory) but ithas been ignored at retrieval time: similarity-search returned three near-duplicate posts about the same topic from a single louder speaker, and shadowed User_13’s actual ...

work page
[43]

Please weigh in from Finance and Data Engineering

User_12 (Compliance Officer): “...Please weigh in from Finance and Data Engineering ...”

work page
[44]

cross-functional review: Finance, Data Engineering, QA, and template owners

User_4 (IT Systems Lead): “...cross-functional review: Finance, Data Engineering, QA, and template owners...”

work page
[45]

I need Finance and Engineering to confirm

User_12 (Compliance Officer): “...I need Finance and Engineering to confirm ...” What was lost.The right entities (Finance, Data Engineering)arein the retrieval, but they are scattered across threeotherusers’ requests, each with slightly different counterparty lists. The agent unions the candidate sets rather than honoring the asker’s specific request. Ag...

work page
[46]

User_7 (Data Analyst, early phase)

work page
[47]

User_7 (Data Analyst, late phase)

work page
[48]

Finance, Operations, Reporting Owners, Compliance,Data Engineering, QA, and Template Owners

User_4 (IT Systems Lead, late phase) Agent answer:“Finance, Operations, Reporting Owners, Compliance,Data Engineering, QA, and Template Owners.” Why it fails.Same root cause as hipporag: the asker’s specific request was never retrieved, so the agent assembled a “who-has-ever-been-mentioned” list. The two correct names (Finance, Data Engineering) are in th...

work page

[1] [1]

OpenClaw-RL: Train Any Agent Simply by Talking

Yinjie Wang, Xuyang Chen, Xiaolong Jin, Mengdi Wang, and Ling Yang. Openclaw-rl: Train any agent simply by talking.arXiv preprint arXiv:2603.10165, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[2] [2]

Agent Skills for Large Language Models: Architecture, Acquisition, Security, and the Path Forward

Renjun Xu and Yang Yan. Agent skills for large language models: Architecture, acquisition, security, and the path forward.arXiv preprint arXiv:2602.12430, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[3] [3]

Trust in ai chatbots: A systematic review.Telematics and Informatics, 97:102240, 2025

Sheryl Wei Ting Ng and Renwen Zhang. Trust in ai chatbots: A systematic review.Telematics and Informatics, 97:102240, 2025

work page 2025

[4] [4]

Personal LLM Agents: Insights and Survey about the Capability, Efficiency and Security

Yuanchun Li, Hao Wen, Weijun Wang, Xiangyu Li, Yizhen Yuan, Guohong Liu, Jiacheng Liu, Wenxing Xu, Xiang Wang, Yi Sun, et al. Personal llm agents: Insights and survey about the capability, efficiency and security.arXiv preprint arXiv:2401.05459, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[5] [5]

A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

Lei Wang, Chen Ma, Xueyang Feng, Zeyu Zhang, Hao Yang, Jingsen Zhang, Zhiyuan Chen, Jiakai Tang, Xu Chen, Yankai Lin, et al. A survey on large language model based autonomous agents.Frontiers of Computer Science, 18(6):186345, 2024

work page 2024

[6] [6]

An introduction to microsoft copilot

Jess Stratton. An introduction to microsoft copilot. InCopilot for Microsoft 365: harness the power of generative AI in the Microsoft apps you use every day, pages 19–35. Springer, 2024

work page 2024

[7] [7]

Rethinking memory mechanisms of foundation agents in the second half.arXiv preprint arXiv:2602.06052, 2026

Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, et al. Rethinking memory mechanisms of foundation agents in the second half.arXiv preprint arXiv:2602.06052, 2026

work page arXiv 2026

[8] [8]

Latimer, N

Chris Latimer, Nicoló Boschi, Andrew Neeser, Chris Bartholomew, Gaurav Srivastava, Xuan Wang, and Naren Ramakrishnan. Hindsight is 20/20: Building agent memory that retains, recalls, and reflects.arXiv preprint arXiv:2512.12818, 2025

work page arXiv 2025

[9] [9]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory.arXiv preprint arXiv:2504.19413, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[10] [10]

Memoryarena: Benchmarking agent memory in interdependent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

Zexue He, Yu Wang, Churan Zhi, Yuanzhe Hu, Tzu-Ping Chen, Lang Yin, Ze Chen, Tong Arthur Wu, Siru Ouyang, Zihan Wang, et al. Memoryarena: Benchmarking agent memory in interde- pendent multi-session agentic tasks.arXiv preprint arXiv:2602.16313, 2026

work page arXiv 2026

[11] [11]

Ama-bench: Evaluating long-horizon memory for agentic llms,

Yujie Zhao, Boqin Yuan, Junbo Huang, Haocheng Yuan, Zhongming Yu, Haozhou Xu, Lanxiang Hu, Abhilash Shankarampeta, Zimeng Huang, Wentao Ni, et al. Ama-bench: Evaluating long- horizon memory for agentic applications.arXiv preprint arXiv:2602.22769, 2026

work page internal anchor Pith review arXiv 2026

[12] [12]

MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems

Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. Memorybench: A benchmark for memory and continual learning in llm systems.arXiv preprint arXiv:2510.17281, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[13] [13]

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory.arXiv preprint arXiv:2410.10813, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[14] [14]

Theory of mind.Current biology, 15(17):R644–R645, 2005

Chris Frith and Uta Frith. Theory of mind.Current biology, 15(17):R644–R645, 2005. 10

work page 2005

[15] [15]

Fantom: A benchmark for stress-testing machine theory of mind in interactions

Hyunwoo Kim, Melanie Sclar, Xuhui Zhou, Ronan Bras, Gunhee Kim, Yejin Choi, and Maarten Sap. Fantom: A benchmark for stress-testing machine theory of mind in interactions. In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 14397–14413, 2023

work page 2023

[16] [16]

Tomato: Verbalizing the mental states of role-playing llms for benchmarking theory of mind

Kazutoshi Shinoda, Nobukatsu Hojo, Kyosuke Nishida, Saki Mizuno, Keita Suzuki, Ryo Masumura, Hiroaki Sugiyama, and Kuniko Saito. Tomato: Verbalizing the mental states of role-playing llms for benchmarking theory of mind. InProceedings of the AAAI Conference on Artificial Intelligence, volume 39, pages 1520–1528, 2025

work page 2025

[17] [17]

Grounding in communication

Herbert H Clark and Susan E Brennan. Grounding in communication. 1991

work page 1991

[18] [18]

Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. Retrieval-augmented generation for knowledge-intensive nlp tasks.Advances in neural information processing systems, 33:9459–9474, 2020

work page 2020

[19] [19]

Hipporag: Neurobio- logically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024

Bernal J Gutiérrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobio- logically inspired long-term memory for large language models.Advances in neural information processing systems, 37:59532–59569, 2024

work page 2024

[20] [20]

Evaluating very long-term conversational memory of llm agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of llm agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, 2024

work page 2024

[21] [21]

Evaluating Memory in LLM Agents via Incremental Multi-Turn Interactions

Yuanzhe Hu, Yu Wang, and Julian McAuley. Evaluating memory in llm agents via incremental multi-turn interactions.arXiv preprint arXiv:2507.05257, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[22] [22]

Chuanrui Hu, Tong Li, Xingze Gao, Hongda Chen, Dannong Xu, Yi Bai, Tianwei Lin, Xinda Zhao, Xiaohong Li, Jiaqi An, et al. Evermembench: Benchmarking long-term interactive memory in large language modelsevermembench: Benchmarking long-term interactive memory in large language models.arXiv preprint arXiv:2602.01313, 2026

work page arXiv 2026

[23] [23]

Memorybank: Enhancing large language models with long-term memory

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. InProceedings of the AAAI conference on artificial intelligence, volume 38, pages 19724–19731, 2024

work page 2024

[24] [24]

A survey on the memory mechanism of large language model-based agents

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model-based agents. ACM Transactions on Information Systems, 43(6):1–47, 2025

work page 2025

[25] [25]

In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents

Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, et al. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 8416–8439, 2025

work page 2025

[26] [26]

Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

Bowen Jiang, Yuan Yuan, Maohao Shen, Zhuoqun Hao, Zhangchen Xu, Zichen Chen, Ziyi Liu, Anvesh Rao Vijjini, Jiashu He, Hanchao Yu, et al. Personamem-v2: Towards person- alized intelligence via learning implicit user personas and agentic memory.arXiv preprint arXiv:2512.06688, 2025

work page arXiv 2025

[27] [27]

MAGMA: A Multi-Graph based Agentic Memory Architecture for AI Agents

Dongming Jiang, Yi Li, Guanpeng Li, and Bingzhe Li. Magma: A multi-graph based agentic memory architecture for ai agents.arXiv preprint arXiv:2601.03236, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026

[28] [28]

Memgpt: towards llms as operating systems

Charles Packer, Vivian Fang, Shishir_G Patil, Kevin Lin, Sarah Wooders, and Joseph_E Gonza- lez. Memgpt: towards llms as operating systems. 2023

work page 2023

[29] [29]

A-MEM: Agentic Memory for LLM Agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[30] [30]

Membench: Towards more comprehensive evaluation on the memory of llm-based agents

Haoran Tan, Zeyu Zhang, Chen Ma, Xu Chen, Quanyu Dai, and Zhenhua Dong. Membench: Towards more comprehensive evaluation on the memory of llm-based agents. InFindings of the Association for Computational Linguistics: ACL 2025, pages 19336–19352, 2025. 11

work page 2025

[31] [31]

G-eval: Nlg evaluation using gpt-4 with better human alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G-eval: Nlg evaluation using gpt-4 with better human alignment. InProceedings of the 2023 conference on empirical methods in natural language processing, 2023

work page 2023

[32] [32]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph rag approach to query-focused summarization.arXiv preprint arXiv:2404.16130, 2024. Appendix A Graph Schema Node types and attributes.The synthesis graph G contains four s...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[33] [33]

If the suffix contains any ofincorrect,wrong, ornot correct, the verdict isIncorrect

work page

[34] [34]

Otherwise, if the suffix containscorrect, the verdict isCorrect

work page

[35] [35]

Who do I need aligned on formatting rules for the mitigation plan in theRisk: Formatting Inconsistenciesphase?

Otherwise, the verdict is recorded asUnclearand excluded from the accuracy denominator. Negative phrases are checked first becausenot correct is a substring of the positive trigger; the implementation is ineval_lib.py(lines 146–152). Reliability check.We manually re-examined 100 (question, gold answer, predicted answer, judge verdict) tuples sampled from ...

work page 2025

[36] [37]

User_7 / Data Analyst / Risk: Formatting Inconsistencies

work page

[37] [38]

User_13 / Compliance Officer / 2025-07-19 (Msg_1545)

work page 2025

[38] [39]

Finance and Data Engineering

User_13 / Compliance Officer / 2025-07-23 (Msg_28294)← answer here Agent answer:“Finance and Data Engineering.” Why it works:the gpt-5 agent reads the full top-10 context and surfaces the correct phrasing from rank 7. The pipeline survives because nothing was rewritten—it just relied on a longer effective window than BM25 did. hindsight ✓ Correct(LLM-rewr...

work page 2025

[39] [40]

Author: User_7 / Data Analyst / phase=Risk: Formatting Inconsistencies (early)

work page

[40] [41]

Author: User_7 / Data Analyst / phase=Risk: Formatting Inconsistencies (mid)

work page

[41] [42]

Agent answer:(empty) The agent declines to answer because none of the retrieved User_7 posts name a counterparty for User_13

Author: User_7 / Data Analyst / phase=Risk: Formatting Inconsistencies (late) What was lost.Speaker identity isn’t physically erased ( Author: User_7 is in every retrieved memory) but ithas been ignored at retrieval time: similarity-search returned three near-duplicate posts about the same topic from a single louder speaker, and shadowed User_13’s actual ...

work page

[42] [43]

Please weigh in from Finance and Data Engineering

User_12 (Compliance Officer): “...Please weigh in from Finance and Data Engineering ...”

work page

[43] [44]

cross-functional review: Finance, Data Engineering, QA, and template owners

User_4 (IT Systems Lead): “...cross-functional review: Finance, Data Engineering, QA, and template owners...”

work page

[44] [45]

I need Finance and Engineering to confirm

User_12 (Compliance Officer): “...I need Finance and Engineering to confirm ...” What was lost.The right entities (Finance, Data Engineering)arein the retrieval, but they are scattered across threeotherusers’ requests, each with slightly different counterparty lists. The agent unions the candidate sets rather than honoring the asker’s specific request. Ag...

work page

[45] [46]

User_7 (Data Analyst, early phase)

work page

[46] [47]

User_7 (Data Analyst, late phase)

work page

[47] [48]

Finance, Operations, Reporting Owners, Compliance,Data Engineering, QA, and Template Owners

User_4 (IT Systems Lead, late phase) Agent answer:“Finance, Operations, Reporting Owners, Compliance,Data Engineering, QA, and Template Owners.” Why it fails.Same root cause as hipporag: the asker’s specific request was never retrieved, so the agent assembled a “who-has-ever-been-mentioned” list. The two correct names (Finance, Data Engineering) are in th...

work page