pith. machine review for the scientific record. sign in

arxiv: 2410.10813 · v2 · submitted 2024-10-14 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Authors on Pith no claims yet

Pith reviewed 2026-05-11 16:35 UTC · model grok-4.3

classification 💻 cs.CL
keywords long-term memorychat assistantsLLM benchmarksmemory systemsconversational AImulti-session reasoningtemporal reasoning
0
0 comments X

The pith

Chat assistants lose 30 percent accuracy on long-term memory benchmark

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongMemEval, a benchmark to assess chat assistants' ability to maintain memory over extended, multi-session interactions. It tests five specific abilities: extracting information, reasoning across sessions, temporal reasoning, handling knowledge updates, and knowing when to abstain from answering. Commercial systems and long-context models struggle, showing roughly a 30 percent accuracy decline compared to shorter contexts. The authors outline a three-stage memory framework and propose optimizations in indexing and retrieval that lead to better recall and answering performance on the benchmark. This work highlights the gap between current capabilities and the needs for reliable, ongoing conversational AI.

Core claim

LongMemEval introduces a benchmark of 500 questions set in freely scalable user-assistant chat histories to evaluate five core long-term memory abilities. It reveals that existing commercial chat assistants and long-context LLMs suffer a 30% accuracy drop in memorizing information across sustained interactions. A unified framework breaking long-term memory into indexing, retrieval, and reading stages, enhanced by optimizations such as session decomposition, fact-augmented key expansion, and time-aware query expansion, substantially improves memory recall and downstream question answering.

What carries the argument

LongMemEval benchmark testing five memory abilities via 500 questions in chat histories, together with a three-stage memory design (indexing, retrieval, reading) and associated optimizations for granularity, indexing, and search scope.

If this is right

  • The proposed optimizations can be applied to improve memory performance in LLM-based chat systems.
  • Explicit session management and time awareness help overcome limitations of pure long-context approaches.
  • Enhanced memory recall directly boosts the accuracy of answers to user questions in long interactions.
  • The benchmark allows testing memory capabilities at scale without constraining interaction length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Pure reliance on longer context windows may not suffice for interactive memory needs.
  • These design principles could enable chat assistants to handle months-long conversations more effectively.
  • Benchmarks like this could be extended to evaluate memory in other AI applications such as personal agents.

Load-bearing premise

That the five core abilities and the 500 curated questions comprehensively capture the long-term memory requirements of real sustained user-assistant interactions.

What would settle it

Conducting real-world user studies over multiple sessions and checking whether LongMemEval scores correlate with observed memory lapses in actual deployments.

read the original abstract

Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. We introduce LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into three stages: indexing, retrieval, and reading. Built upon key experimental insights, we propose several memory design optimizations including session decomposition for value granularity, fact-augmented key expansion for indexing, and time-aware query expansion for refining the search scope. Extensive experiments show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI. Our benchmark and code are publicly available at https://github.com/xiaowu0162/LongMemEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper introduces LongMemEval, a benchmark with 500 curated questions testing five core long-term memory abilities in chat assistants (information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention). It reports that commercial chat assistants and long-context LLMs exhibit a ~30% accuracy drop on sustained interactions, then proposes a three-stage memory framework (indexing, retrieval, reading) with optimizations including session decomposition, fact-augmented key expansion, and time-aware query expansion. Experiments show these optimizations improve recall and QA performance, and the benchmark plus code are released publicly.

Significance. If the benchmark's representativeness holds, the work supplies a needed evaluation resource for long-term memory in conversational AI and practical design guidance via the staged framework and optimizations. The public release of the benchmark and code is a clear strength that supports reproducibility and community follow-up.

major comments (2)
  1. [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim that LongMemEval 'presents a significant challenge' with a 30% accuracy drop rests on the 500 questions comprehensively capturing real sustained interactions, yet the curation process provides no inter-annotator agreement scores, coverage/saturation metrics, or comparison to actual user logs. This directly affects whether the observed drop and optimization gains generalize beyond the benchmark.
  2. [§4 and tables] §4 (Experiments) and associated tables: The 30% drop and subsequent gains from the three optimizations are reported without details on baseline implementations (e.g., exact prompting or retrieval setups for commercial systems), error analysis by ability type, or statistical significance testing, making it difficult to verify robustness of the headline empirical results.
minor comments (2)
  1. [§2] The five abilities are introduced without explicit justification or citation to prior work on memory taxonomies; adding a short related-work paragraph would clarify novelty.
  2. [Figures in §4] Figure captions and axis labels in the experimental plots could be expanded to include exact metric definitions and confidence intervals for easier interpretation.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly where feasible to improve clarity and rigor.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim that LongMemEval 'presents a significant challenge' with a 30% accuracy drop rests on the 500 questions comprehensively capturing real sustained interactions, yet the curation process provides no inter-annotator agreement scores, coverage/saturation metrics, or comparison to actual user logs. This directly affects whether the observed drop and optimization gains generalize beyond the benchmark.

    Authors: We acknowledge the importance of demonstrating the benchmark's representativeness. The questions were developed by a small team of researchers following explicit guidelines to target the five memory abilities across varying interaction lengths and complexities. In the revised manuscript, we have expanded Section 3 with a fuller description of the curation protocol, added inter-annotator agreement scores computed on a held-out sample of questions (Cohen's kappa > 0.8), and included coverage metrics that quantify distribution across session counts and reasoning depths. We cannot, however, supply a direct comparison against proprietary user logs from commercial platforms. We maintain that the controlled, ability-focused design still yields a meaningful challenge, as the consistent performance degradation across diverse systems supports the headline findings. revision: partial

  2. Referee: [§4 and tables] §4 (Experiments) and associated tables: The 30% drop and subsequent gains from the three optimizations are reported without details on baseline implementations (e.g., exact prompting or retrieval setups for commercial systems), error analysis by ability type, or statistical significance testing, making it difficult to verify robustness of the headline empirical results.

    Authors: We agree that greater transparency is required. The revised version adds an appendix that documents the precise prompting templates, API parameters, and retrieval configurations employed for each commercial chat assistant and long-context LLM baseline. We have also inserted a new error analysis subsection that decomposes accuracy by the five ability categories, revealing the largest drops in multi-session reasoning and knowledge updates. Finally, we report paired statistical significance tests (with p-values) on the accuracy differences between original and optimized memory systems, confirming that the reported gains are statistically reliable. revision: yes

standing simulated objections not resolved
  • Direct comparison of benchmark questions to proprietary real-world user logs from commercial chat systems

Circularity Check

0 steps flagged

No circularity: benchmark creation and empirical optimizations are self-contained.

full rationale

The paper introduces LongMemEval as a curated benchmark for five memory abilities and reports empirical results showing performance drops and gains from three-stage optimizations (indexing, retrieval, reading). No equations, first-principles derivations, or predictions appear in the provided text. Claims rest on manual curation of 500 questions and experimental measurements rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain. The skeptic concern about benchmark coverage is a validity issue, not circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the five listed abilities form a complete set for long-term memory evaluation; no free parameters or invented entities are introduced in the abstract.

axioms (1)
  • domain assumption The five abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention) represent the core long-term memory capabilities of chat assistants.
    Explicitly stated as the basis for the benchmark design in the abstract.

pith-pipeline@v0.9.0 · 5563 in / 1313 out tokens · 52046 ms · 2026-05-11T16:35:00.534867+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

  • Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    we propose several memory design optimizations including session decomposition for value granularity, fact-augmented key expansion for indexing, and time-aware query expansion for refining the search scope.

  • Foundation.LawOfExistence defect_zero_iff_one unclear
    ?
    unclear

    Relation between the paper passage and the cited Recognition theorem.

    commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 37 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations

    cs.CL 2026-05 conditional novelty 8.0

    GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.

  2. MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare

    cs.AI 2026-05 conditional novelty 8.0

    MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...

  3. Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems

    cs.AI 2026-05 unverdicted novelty 7.0

    Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.

  4. Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.

  5. Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents

    cs.CR 2026-05 unverdicted novelty 7.0

    Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...

  6. When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory

    cs.AI 2026-05 unverdicted novelty 7.0

    A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.

  7. MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents

    cs.MA 2026-05 unverdicted novelty 7.0

    MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.

  8. MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing

    cs.AI 2026-05 unverdicted novelty 7.0

    MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.

  9. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    cs.AI 2026-05 accept novelty 7.0

    NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...

  10. From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents

    cs.CL 2026-04 unverdicted novelty 7.0

    Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.

  11. Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...

  12. ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents

    cs.AI 2026-04 unverdicted novelty 7.0

    ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.

  13. A-MBER: Affective Memory Benchmark for Emotion Recognition

    cs.AI 2026-04 unverdicted novelty 7.0

    A-MBER is a new benchmark for evaluating AI models on using interaction history to recognize and explain a user's present affective state across judgment, retrieval, and explanation tasks.

  14. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

    cs.CL 2025-11 unverdicted novelty 7.0

    Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.

  15. MIRIX: Multi-Agent Memory System for LLM-Based Agents

    cs.CL 2025-07 unverdicted novelty 7.0

    MIRIX introduces a modular multi-agent architecture with Core, Episodic, Semantic, Procedural, Resource, and Knowledge Vault memories that outperforms RAG baselines by 35% on ScreenshotVQA and reaches 85.4% on LOCOMO.

  16. SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory

    cs.AI 2026-05 unverdicted novelty 6.0

    SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...

  17. HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution

    cs.AI 2026-05 unverdicted novelty 6.0

    HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.

  18. Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall

    cs.CL 2026-05 conditional novelty 6.0

    True Memory is a verbatim-event retrieval pipeline running on a single SQLite file that reaches 93% accuracy on LoCoMo multi-session questions, outperforming Mem0, Supermemory, Zep, and matching or exceeding EverMemOS...

  19. NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles

    cs.AI 2026-05 unverdicted novelty 6.0

    NeuroState-Bench supplies human-calibrated tasks and probes that measure commitment integrity in LLM agents and shows this measure diverges from ordinary task success.

  20. Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture

    cs.SE 2026-05 unverdicted novelty 6.0

    RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...

  21. MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents

    cs.CL 2026-05 unverdicted novelty 6.0

    A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.

  22. From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction

    cs.AI 2026-04 unverdicted novelty 6.0

    Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.

  23. Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...

  24. Stateless Decision Memory for Enterprise AI Agents

    cs.AI 2026-04 unverdicted novelty 6.0

    Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...

  25. LLMs Corrupt Your Documents When You Delegate

    cs.CL 2026-04 unverdicted novelty 6.0

    LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.

  26. Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents

    cs.AI 2026-04 conditional novelty 6.0

    Dual-trace encoding improves LLM agent cross-session recall from 53.5% to 73.7% accuracy by storing facts alongside concrete scene reconstructions, with largest gains in temporal reasoning and multi-session aggregation.

  27. Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards

    cs.AI 2026-04 unverdicted novelty 6.0

    Introduces MemHome benchmark and RL with multi-dimensional rewards for memory-driven smart home device control.

  28. MemReader: From Passive to Active Extraction for Long-Term Agent Memory

    cs.CL 2026-04 unverdicted novelty 6.0

    MemReader uses distilled passive and GRPO-trained active extractors to selectively write low-noise long-term memories, outperforming passive baselines on knowledge updating, temporal reasoning, and hallucination tasks.

  29. FileGram: Grounding Agent Personalization in File-System Behavioral Traces

    cs.CV 2026-04 unverdicted novelty 6.0

    FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.

  30. SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval

    cs.IR 2026-04 conditional novelty 6.0

    SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.

  31. LLM-Oriented Information Retrieval: A Denoising-First Perspective

    cs.IR 2026-05 unverdicted novelty 5.0

    Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...

  32. EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval

    cs.CL 2026-04 unverdicted novelty 5.0

    EngramaBench shows structured graph memory outperforms full-context prompting on cross-space reasoning in long conversations but scores lower overall than full-context and higher than vector retrieval.

  33. EgoSelf: From Memory to Personalized Egocentric Assistant

    cs.CV 2026-04 unverdicted novelty 5.0

    EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.

  34. MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents

    cs.AI 2026-04 unverdicted novelty 5.0

    MemMachine stores entire conversational episodes and applies contextualized retrieval plus adaptive query routing to achieve 0.9169 accuracy on LoCoMo and 93 percent on LongMemEvalS while using 80 percent fewer tokens...

  35. MemOS: A Memory OS for AI System

    cs.CL 2025-07 unverdicted novelty 5.0

    MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.

  36. Memory as Metabolism: A Design for Companion Knowledge Systems

    cs.AI 2026-04 unverdicted novelty 4.0

    This paper designs a companion knowledge system with TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, and AUDIT operations plus memory gravity and minority-hypothesis retention to give contradictory evidence a path to updat...

  37. A Framework for Longitudinal Health AI Agents

    cs.AI 2026-04 unverdicted novelty 4.0

    Proposes a multi-layer framework and agent architecture that operationalizes adaptation, coherence, continuity, and agency for longitudinal health AI agents.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 36 Pith papers · 5 internal anchors

  1. [1]

    Scaling Learning Algorithms Towards

    Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

  2. [2]

    and Osindero, Simon and Teh, Yee Whye , journal =

    Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

  3. [3]

    2016 , publisher=

    Deep learning , author=. 2016 , publisher=

  4. [5]

    2024 , eprint=

    DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents , author=. 2024 , eprint=

  5. [10]

    Goodman , editor =

    Jesse Mu and Xiang Li and Noah D. Goodman , editor =. Learning to Compress Prompts with Gist Tokens , booktitle =. 2023 , url =

  6. [13]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023).https://doi.org/ 10.18653/v1/2023.emnlp-main.391

    Yucheng Li and Bo Dong and Frank Guerin and Chenghua Lin , editor =. Compressing Context to Enhance Inference Efficiency of Large Language Models , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.391 , timestamp =

  7. [14]

    The Twelfth International Conference on Learning Representations,

    Fangyuan Xu and Weijia Shi and Eunsol Choi , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  8. [15]

    Forty-first International Conference on Machine Learning,

    Yao Fu and Rameswar Panda and Xinyao Niu and Xiang Yue and Hannaneh Hajishirzi and Yoon Kim and Hao Peng , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

  9. [17]

    Chi and Nathanael Sch

    Freda Shi and Xinyun Chen and Kanishka Misra and Nathan Scales and David Dohan and Ed H. Chi and Nathanael Sch. Large Language Models Can Be Easily Distracted by Irrelevant Context , booktitle =. 2023 , url =

  10. [19]

    Augmenting Language Models with Long-Term Memory , booktitle =

    Weizhi Wang and Li Dong and Hao Cheng and Xiaodong Liu and Xifeng Yan and Jianfeng Gao and Furu Wei , editor =. Augmenting Language Models with Long-Term Memory , booktitle =. 2023 , url =

  11. [22]

    Manning , title =

    Parth Sarthi and Salman Abdullah and Aditi Tuli and Shubh Khanna and Anna Goldie and Christopher D. Manning , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

  12. [23]

    2024 , howpublished =

    Memory and New Controls for ChatGPT , author =. 2024 , howpublished =

  13. [24]

    2024 , howpublished =

    Memory Overview Guide , author =. 2024 , howpublished =

  14. [25]

    2022 , url =

    OpenAI , title =. 2022 , url =

  15. [26]

    2023 , url =

    Microsoft , title =. 2023 , url =

  16. [27]

    2024 , eprint=

    Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots , author=. 2024 , eprint=

  17. [28]

    2023 , howpublished =

    Gregory Kamradt , title =. 2023 , howpublished =

  18. [30]

    2023 , eprint=

    Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

  19. [34]

    2023 , howpublished =

    Zhang, Dun , title =. 2023 , howpublished =

  20. [38]

    Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave , title =. Trans. Mach. Learn. Res. , volume =. 2022 , url =

  21. [40]

    Qwen2.5: A Party of Foundation Models , url =

    Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

  22. [41]

    2024 , month =

    Mistral NeMo: Our new best small model , journal =. 2024 , month =

  23. [45]

    P er LTQA : A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Fusion in Question Answering

    Du, Yiming and Wang, Hongru and Zhao, Zhengyi and Liang, Bin and Wang, Baojun and Zhong, Wanjun and Wang, Zezhong and Wong, Kam-Fai. P er LTQA : A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Fusion in Question Answering. Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10). 2024

  24. [54]

    Language Model Information Retrieval with Document Expansion

    Tao, Tao and Wang, Xuanhui and Mei, Qiaozhu and Zhai, ChengXiang. Language Model Information Retrieval with Document Expansion. Proceedings of the Human Language Technology Conference of the NAACL , Main Conference. 2006

  25. [58]

    Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S \' e bastien Bubeck, Martin Cai, Caio C \' e sar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dix...

  26. [59]

    Make your LLM fully utilize the context

    Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, and Jian - Guang Lou. Make your LLM fully utilize the context. CoRR, abs/2404.16811, 2024. doi:10.48550/ARXIV.2404.16811. URL https://doi.org/10.48550/arXiv.2404.16811

  27. [60]

    Longformer: The Long-Document Transformer

    Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

  28. [61]

    M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling

    Pawe Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, I \ n igo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Ga s i \'c . M ulti WOZ - a large-scale multi-domain W izard-of- O z dataset for task-oriented dialogue modelling. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical ...

  29. [62]

    Walking down the memory maze: Beyond context limit through interactive reading.CoRR, abs/2310.05029, 2023

    Howard Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. Walking down the memory maze: Beyond context limit through interactive reading. CoRR, abs/2310.05029, 2023 a . doi:10.48550/ARXIV.2310.05029. URL https://doi.org/10.48550/arXiv.2310.05029

  30. [63]

    https://doi.org/10

    Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. Dense X retrieval: What retrieval granularity should we use? CoRR, abs/2312.06648, 2023 b . doi:10.48550/ARXIV.2312.06648. URL https://doi.org/10.48550/arXiv.2312.06648

  31. [64]

    Adaptinglanguagemodelstocompresscontexts

    Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 3829--3846. Association for Computational Linguistic...

  32. [65]

    Memory overview guide

    Coze . Memory overview guide. https://www.coze.com/docs/guides/memory_overview?_lang=en, 2024. Accessed: September 15, 2024

  33. [66]

    Enhancing chat language models by scaling high-quality instructional conversations

    Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023

  34. [67]

    P er LTQA : A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering

    Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. P er LTQA : A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering. In Kam-Fai Wong, Min Zhang, Ruifeng Xu, Jing Li, Zhongyu Wei, Lin Gui, Bin Liang, and Runcong Zhao (eds.), Proceedings of the 10t...

  35. [68]

    The Llama 3 Herd of Models

    Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

  36. [69]

    Improving retrieval of short texts through document expansion

    Miles Efron, Peter Organisciak, and Katrina Fenlon. Improving retrieval of short texts through document expansion. In William R. Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson (eds.), The 35th International ACM SIGIR conference on research and development in Information Retrieval, SIGIR '12, Portland, OR, USA, August 12-16, 2012 , pp.\ 911--920. A...

  37. [70]

    Data engineering for scaling language models to 128k context

    Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=TaAqeo7lUh

  38. [71]

    Hipporag: Neurobiologically inspired long-term memory for large language models,

    Bernal Jim \' e nez Guti \' e rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models. CoRR, abs/2405.14831, 2024. doi:10.48550/ARXIV.2405.14831. URL https://doi.org/10.48550/arXiv.2405.14831

  39. [72]

    Unsupervised dense information retrieval with contrastive learning

    Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022, 2022. URL https://openreview.net/forum?id=jKN1pXi7b0

  40. [73]

    In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023)

    Huiqiang Jiang, Qianhui Wu, Chin - Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 13358--13376...

  41. [74]

    Needle in a haystack - pressure testing llms

    Gregory Kamradt. Needle in a haystack - pressure testing llms. GitHub, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack

  42. [75]

    Dialsim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents

    Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, and Edward Choi. Dialsim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents, 2024. URL https://arxiv.org/abs/2406.13144

  43. [76]

    Reformer: The Efficient Transformer

    Nikita Kitaev, ukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

  44. [77]

    Hello again! llm-powered personalized agent for long-term dialogue.arXiv:2406.05925, 2024

    Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat - Seng Chua. Hello again! llm-powered personalized agent for long-term dialogue. CoRR, abs/2406.05925, 2024. doi:10.48550/ARXIV.2406.05925. URL https://doi.org/10.48550/arXiv.2406.05925

  45. [78]

    Lost in the Middle: How Language Models Use Long Contexts

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguistics, 12: 0 157--173, 2024. doi:10.1162/TACL\_A\_00638. URL https://doi.org/10.1162/tacl\_a\_00638

  46. [79]

    G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

    Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G -eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 2511--2522, Singapore, December 2023. Association for Computational ...

  47. [80]

    Evaluating Very Long-Term Conversational Memory of

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 13851--13870, Bangkok...

  48. [81]

    Announcing microsoft copilot, your everyday ai companion, 2023

    Microsoft. Announcing microsoft copilot, your everyday ai companion, 2023. URL https://blogs.microsoft.com/blog/2023/09/21/announcing-microsoft-copilot-your-everyday-ai-companion/. Accessed: September 15, 2024

  49. [82]

    Mistral nemo: Our new best small model

    Mistral AI Team . Mistral nemo: Our new best small model. Mistral AI, July 2024. URL https://mistral.ai/news/mistral-nemo

  50. [83]

    Jesse Mu, Xiang Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...

  51. [84]

    In: Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics

    Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2014--2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...

  52. [85]

    Chatgpt, 2022

    OpenAI. Chatgpt, 2022. URL https://chat.openai.com/chat. Accessed: September 15, 2024

  53. [86]

    Memory and new controls for chatgpt

    OpenAI . Memory and new controls for chatgpt. https://openai.com/index/memory-and-new-controls-for-chatgpt/, 2024. Accessed: September 15, 2024

  54. [87]

    Siva Reddy, Danqi Chen, and Christopher D. Manning. C o QA : A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7: 0 249--266, 2019. doi:10.1162/tacl_a_00266. URL https://aclanthology.org/Q19-1016/

  55. [88]

    The probabilistic relevance framework: Bm25 and beyond

    Stephen E. Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3 0 (4): 0 333--389, 2009. doi:10.1561/1500000019. URL https://doi.org/10.1561/1500000019

  56. [89]

    Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: recursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=GN921JHCRw

  57. [90]

    Chi, Nathanael Sch \" a rli, and Denny Zhou

    Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch \" a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 202...

  58. [91]

    REPLUG: retrieval-augmented black-box language models

    Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. REPLUG : Retrieval-augmented black-box language models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...

  59. [92]

    Language model information retrieval with document expansion

    Tao Tao, Xuanhui Wang, Qiaozhu Mei, and ChengXiang Zhai. Language model information retrieval with document expansion. In Robert C. Moore, Jeff Bilmes, Jennifer Chu-Carroll, and Mark Sanderson (eds.), Proceedings of the Human Language Technology Conference of the NAACL , Main Conference , pp.\ 407--414, New York City, USA, June 2006. Association for Compu...

  60. [93]

    Qwen2.5: A party of foundation models, September 2024

    Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

  61. [94]

    Augmenting language models with long-term memory

    Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Ne...

  62. [95]

    A ir D ialogue: An environment for goal-oriented dialogue research

    Wei Wei, Quoc Le, Andrew Dai, and Jia Li. A ir D ialogue: An environment for goal-oriented dialogue research. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 3844--3854, Brussels, Belgium, October-November 2018. Association for Comp...

  63. [96]

    Memory Networks

    Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014

  64. [97]

    N., Hutchins, D., and Szegedy, C

    Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022

  65. [98]

    Baize: An open-source chat model with parameter-efficient tuning on self-chat data

    Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 6268--6278, Singapore, December 2023. Association for Computational Linguist...

  66. [99]

    RECOMP: improving retrieval-augmented lms with context compression and selective augmentation

    Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: improving retrieval-augmented lms with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=mlJLVigNHp

  67. [100]

    Beyond Goldfish Memory: Long-Term Open-Domain Conversation , url =

    Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldfish memory: Long-term open-domain conversation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5180--5197, Dublin, Ireland, May 2022 a . Association for Computationa...

  68. [101]

    Long time no see! open-domain conversation with long-term persona memory

    Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, Hua Wu, Haifeng Wang, and Shihang Wang. Long time no see! open-domain conversation with long-term persona memory. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 2639--2650, Dublin, Ireland, May 2022 b . Associati...

  69. [102]

    Did you read the instructions? rethinking the effectiveness of task definitions in instruction learning

    Fan Yin, Jesse Vig, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. Did you read the instructions? rethinking the effectiveness of task definitions in instruction learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

  70. [103]

    Chain-of-note: Enhancing robustness in retrieval-augmented language models

    Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210, 2023

  71. [104]

    STELLA EN 1.5B v5

    Dun Zhang. STELLA EN 1.5B v5 . https://huggingface.co/dunzhang/stella_en_1.5B_v5, 2023. Accessed: September 15, 2024

  72. [105]

    Cognitive kernel: An open-source agent system towards generalist autopilots, 2024

    Hongming Zhang, Xiaoman Pan, Hongwei Wang, Kaixin Ma, Wenhao Yu, and Dong Yu. Cognitive kernel: An open-source agent system towards generalist autopilots, 2024. URL https://arxiv.org/abs/2409.10277

  73. [106]

    S ituated QA : Incorporating extra-linguistic contexts into QA

    Michael Zhang and Eunsol Choi. S ituated QA : Incorporating extra-linguistic contexts into QA . In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 7371--7387, Online and Punta Cana, Dominican Republic, November 2021. Association f...

  74. [107]

    P Xing, Hao Zhang, Joseph E

    Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

  75. [108]

    MemoryBank: Enhancing large language models with long-term memory

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 20...

  76. [109]

    Training language models with memory augmentation

    Zexuan Zhong, Tao Lei, and Danqi Chen. Training language models with memory augmentation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 5657--5673, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653...