Recognition: 3 theorem links
· Lean TheoremLongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory
Pith reviewed 2026-05-11 16:35 UTC · model grok-4.3
The pith
Chat assistants lose 30 percent accuracy on long-term memory benchmark
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
LongMemEval introduces a benchmark of 500 questions set in freely scalable user-assistant chat histories to evaluate five core long-term memory abilities. It reveals that existing commercial chat assistants and long-context LLMs suffer a 30% accuracy drop in memorizing information across sustained interactions. A unified framework breaking long-term memory into indexing, retrieval, and reading stages, enhanced by optimizations such as session decomposition, fact-augmented key expansion, and time-aware query expansion, substantially improves memory recall and downstream question answering.
What carries the argument
LongMemEval benchmark testing five memory abilities via 500 questions in chat histories, together with a three-stage memory design (indexing, retrieval, reading) and associated optimizations for granularity, indexing, and search scope.
If this is right
- The proposed optimizations can be applied to improve memory performance in LLM-based chat systems.
- Explicit session management and time awareness help overcome limitations of pure long-context approaches.
- Enhanced memory recall directly boosts the accuracy of answers to user questions in long interactions.
- The benchmark allows testing memory capabilities at scale without constraining interaction length.
Where Pith is reading between the lines
- Pure reliance on longer context windows may not suffice for interactive memory needs.
- These design principles could enable chat assistants to handle months-long conversations more effectively.
- Benchmarks like this could be extended to evaluate memory in other AI applications such as personal agents.
Load-bearing premise
That the five core abilities and the 500 curated questions comprehensively capture the long-term memory requirements of real sustained user-assistant interactions.
What would settle it
Conducting real-world user studies over multiple sessions and checking whether LongMemEval scores correlate with observed memory lapses in actual deployments.
read the original abstract
Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. We introduce LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into three stages: indexing, retrieval, and reading. Built upon key experimental insights, we propose several memory design optimizations including session decomposition for value granularity, fact-augmented key expansion for indexing, and time-aware query expansion for refining the search scope. Extensive experiments show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI. Our benchmark and code are publicly available at https://github.com/xiaowu0162/LongMemEval.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper introduces LongMemEval, a benchmark with 500 curated questions testing five core long-term memory abilities in chat assistants (information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention). It reports that commercial chat assistants and long-context LLMs exhibit a ~30% accuracy drop on sustained interactions, then proposes a three-stage memory framework (indexing, retrieval, reading) with optimizations including session decomposition, fact-augmented key expansion, and time-aware query expansion. Experiments show these optimizations improve recall and QA performance, and the benchmark plus code are released publicly.
Significance. If the benchmark's representativeness holds, the work supplies a needed evaluation resource for long-term memory in conversational AI and practical design guidance via the staged framework and optimizations. The public release of the benchmark and code is a clear strength that supports reproducibility and community follow-up.
major comments (2)
- [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim that LongMemEval 'presents a significant challenge' with a 30% accuracy drop rests on the 500 questions comprehensively capturing real sustained interactions, yet the curation process provides no inter-annotator agreement scores, coverage/saturation metrics, or comparison to actual user logs. This directly affects whether the observed drop and optimization gains generalize beyond the benchmark.
- [§4 and tables] §4 (Experiments) and associated tables: The 30% drop and subsequent gains from the three optimizations are reported without details on baseline implementations (e.g., exact prompting or retrieval setups for commercial systems), error analysis by ability type, or statistical significance testing, making it difficult to verify robustness of the headline empirical results.
minor comments (2)
- [§2] The five abilities are introduced without explicit justification or citation to prior work on memory taxonomies; adding a short related-work paragraph would clarify novelty.
- [Figures in §4] Figure captions and axis labels in the experimental plots could be expanded to include exact metric definitions and confidence intervals for easier interpretation.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly where feasible to improve clarity and rigor.
read point-by-point responses
-
Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim that LongMemEval 'presents a significant challenge' with a 30% accuracy drop rests on the 500 questions comprehensively capturing real sustained interactions, yet the curation process provides no inter-annotator agreement scores, coverage/saturation metrics, or comparison to actual user logs. This directly affects whether the observed drop and optimization gains generalize beyond the benchmark.
Authors: We acknowledge the importance of demonstrating the benchmark's representativeness. The questions were developed by a small team of researchers following explicit guidelines to target the five memory abilities across varying interaction lengths and complexities. In the revised manuscript, we have expanded Section 3 with a fuller description of the curation protocol, added inter-annotator agreement scores computed on a held-out sample of questions (Cohen's kappa > 0.8), and included coverage metrics that quantify distribution across session counts and reasoning depths. We cannot, however, supply a direct comparison against proprietary user logs from commercial platforms. We maintain that the controlled, ability-focused design still yields a meaningful challenge, as the consistent performance degradation across diverse systems supports the headline findings. revision: partial
-
Referee: [§4 and tables] §4 (Experiments) and associated tables: The 30% drop and subsequent gains from the three optimizations are reported without details on baseline implementations (e.g., exact prompting or retrieval setups for commercial systems), error analysis by ability type, or statistical significance testing, making it difficult to verify robustness of the headline empirical results.
Authors: We agree that greater transparency is required. The revised version adds an appendix that documents the precise prompting templates, API parameters, and retrieval configurations employed for each commercial chat assistant and long-context LLM baseline. We have also inserted a new error analysis subsection that decomposes accuracy by the five ability categories, revealing the largest drops in multi-session reasoning and knowledge updates. Finally, we report paired statistical significance tests (with p-values) on the accuracy differences between original and optimized memory systems, confirming that the reported gains are statistically reliable. revision: yes
- Direct comparison of benchmark questions to proprietary real-world user logs from commercial chat systems
Circularity Check
No circularity: benchmark creation and empirical optimizations are self-contained.
full rationale
The paper introduces LongMemEval as a curated benchmark for five memory abilities and reports empirical results showing performance drops and gains from three-stage optimizations (indexing, retrieval, reading). No equations, first-principles derivations, or predictions appear in the provided text. Claims rest on manual curation of 500 questions and experimental measurements rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain. The skeptic concern about benchmark coverage is a validity issue, not circularity per the enumerated patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The five abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention) represent the core long-term memory capabilities of chat assistants.
Lean theorems connected to this paper
-
Foundation.HierarchyEmergencehierarchy_emergence_forces_phi unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
we propose several memory design optimizations including session decomposition for value granularity, fact-augmented key expansion for indexing, and time-aware query expansion for refining the search scope.
-
Foundation.LawOfExistencedefect_zero_iff_one unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions.
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 34 Pith papers
-
GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
-
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
-
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
-
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
-
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
-
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
-
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
-
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
-
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
-
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
-
A-MBER: Affective Memory Benchmark for Emotion Recognition
A-MBER is a new benchmark for evaluating AI models on using interaction history to recognize and explain a user's present affective state across judgment, retrieval, and explanation tasks.
-
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
-
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
-
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
-
Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall
True Memory is a verbatim-event retrieval pipeline running on a single SQLite file that reaches 93% accuracy on LoCoMo multi-session questions, outperforming Mem0, Supermemory, Zep, and matching or exceeding EverMemOS...
-
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
-
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
-
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
-
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
-
Stateless Decision Memory for Enterprise AI Agents
Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
-
LLMs Corrupt Your Documents When You Delegate
LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
-
Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents
Dual-trace encoding improves LLM agent cross-session recall from 53.5% to 73.7% accuracy by storing facts alongside concrete scene reconstructions, with largest gains in temporal reasoning and multi-session aggregation.
-
Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards
Introduces MemHome benchmark and RL with multi-dimensional rewards for memory-driven smart home device control.
-
MemReader: From Passive to Active Extraction for Long-Term Agent Memory
MemReader uses distilled passive and GRPO-trained active extractors to selectively write low-noise long-term memories, outperforming passive baselines on knowledge updating, temporal reasoning, and hallucination tasks.
-
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.
-
SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval
SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.
-
LLM-Oriented Information Retrieval: A Denoising-First Perspective
Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
-
EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
EngramaBench shows structured graph memory outperforms full-context prompting on cross-space reasoning in long conversations but scores lower overall than full-context and higher than vector retrieval.
-
EgoSelf: From Memory to Personalized Egocentric Assistant
EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
-
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
MemMachine stores entire conversational episodes and applies contextualized retrieval plus adaptive query routing to achieve 0.9169 accuracy on LoCoMo and 93 percent on LongMemEvalS while using 80 percent fewer tokens...
-
Memory as Metabolism: A Design for Companion Knowledge Systems
This paper designs a companion knowledge system with TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, and AUDIT operations plus memory gravity and minority-hypothesis retention to give contradictory evidence a path to updat...
-
A Framework for Longitudinal Health AI Agents
Proposes a multi-layer framework and agent architecture that operationalizes adaptation, coherence, continuity, and agency for longitudinal health AI agents.
Reference graph
Works this paper leans on
-
[1]
Scaling Learning Algorithms Towards
Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards
-
[2]
and Osindero, Simon and Teh, Yee Whye , journal =
Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =
- [3]
-
[5]
DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents , author=. 2024 , eprint=
work page 2024
-
[10]
Jesse Mu and Xiang Li and Noah D. Goodman , editor =. Learning to Compress Prompts with Gist Tokens , booktitle =. 2023 , url =
work page 2023
-
[13]
Yucheng Li and Bo Dong and Frank Guerin and Chenghua Lin , editor =. Compressing Context to Enhance Inference Efficiency of Large Language Models , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.391 , timestamp =
-
[14]
The Twelfth International Conference on Learning Representations,
Fangyuan Xu and Weijia Shi and Eunsol Choi , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
work page 2024
-
[15]
Forty-first International Conference on Machine Learning,
Yao Fu and Rameswar Panda and Xinyao Niu and Xiang Yue and Hannaneh Hajishirzi and Yoon Kim and Hao Peng , title =. Forty-first International Conference on Machine Learning,. 2024 , url =
work page 2024
-
[17]
Freda Shi and Xinyun Chen and Kanishka Misra and Nathan Scales and David Dohan and Ed H. Chi and Nathanael Sch. Large Language Models Can Be Easily Distracted by Irrelevant Context , booktitle =. 2023 , url =
work page 2023
-
[19]
Augmenting Language Models with Long-Term Memory , booktitle =
Weizhi Wang and Li Dong and Hao Cheng and Xiaodong Liu and Xifeng Yan and Jianfeng Gao and Furu Wei , editor =. Augmenting Language Models with Long-Term Memory , booktitle =. 2023 , url =
work page 2023
-
[22]
Parth Sarthi and Salman Abdullah and Aditi Tuli and Shubh Khanna and Anna Goldie and Christopher D. Manning , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =
work page 2024
-
[23]
Memory and New Controls for ChatGPT , author =. 2024 , howpublished =
work page 2024
- [24]
- [25]
- [26]
-
[27]
Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots , author=. 2024 , eprint=
work page 2024
- [28]
-
[30]
Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=
work page 2023
- [34]
-
[38]
Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave , title =. Trans. Mach. Learn. Res. , volume =. 2022 , url =
work page 2022
-
[40]
Qwen2.5: A Party of Foundation Models , url =
Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =
- [41]
-
[45]
Du, Yiming and Wang, Hongru and Zhao, Zhengyi and Liang, Bin and Wang, Baojun and Zhong, Wanjun and Wang, Zezhong and Wong, Kam-Fai. P er LTQA : A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Fusion in Question Answering. Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10). 2024
work page 2024
-
[54]
Language Model Information Retrieval with Document Expansion
Tao, Tao and Wang, Xuanhui and Mei, Qiaozhu and Zhai, ChengXiang. Language Model Information Retrieval with Document Expansion. Proceedings of the Human Language Technology Conference of the NAACL , Main Conference. 2006
work page 2006
-
[58]
Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S \' e bastien Bubeck, Martin Cai, Caio C \' e sar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dix...
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14219 2024
-
[59]
Make your LLM fully utilize the context
Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, and Jian - Guang Lou. Make your LLM fully utilize the context. CoRR, abs/2404.16811, 2024. doi:10.48550/ARXIV.2404.16811. URL https://doi.org/10.48550/arXiv.2404.16811
-
[60]
Longformer: The Long-Document Transformer
Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020
work page internal anchor Pith review Pith/arXiv arXiv 2004
-
[61]
M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling
Pawe Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, I \ n igo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Ga s i \'c . M ulti WOZ - a large-scale multi-domain W izard-of- O z dataset for task-oriented dialogue modelling. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical ...
-
[62]
arXiv preprint arXiv:2310.05029 , year=
Howard Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. Walking down the memory maze: Beyond context limit through interactive reading. CoRR, abs/2310.05029, 2023 a . doi:10.48550/ARXIV.2310.05029. URL https://doi.org/10.48550/arXiv.2310.05029
-
[63]
Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. Dense X retrieval: What retrieval granularity should we use? CoRR, abs/2312.06648, 2023 b . doi:10.48550/ARXIV.2312.06648. URL https://doi.org/10.48550/arXiv.2312.06648
-
[64]
Adaptinglanguagemodelstocompresscontexts
Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 3829--3846. Association for Computational Linguistic...
-
[65]
Coze . Memory overview guide. https://www.coze.com/docs/guides/memory_overview?_lang=en, 2024. Accessed: September 15, 2024
work page 2024
-
[66]
Enhancing chat language models by scaling high-quality instructional conversations
Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023
-
[67]
Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. P er LTQA : A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering. In Kam-Fai Wong, Min Zhang, Ruifeng Xu, Jing Li, Zhongyu Wei, Lin Gui, Bin Liang, and Runcong Zhao (eds.), Proceedings of the 10t...
work page 2024
-
[68]
Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[69]
Improving retrieval of short texts through document expansion
Miles Efron, Peter Organisciak, and Katrina Fenlon. Improving retrieval of short texts through document expansion. In William R. Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson (eds.), The 35th International ACM SIGIR conference on research and development in Information Retrieval, SIGIR '12, Portland, OR, USA, August 12-16, 2012 , pp.\ 911--920. A...
-
[70]
Data engineering for scaling language models to 128k context
Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=TaAqeo7lUh
work page 2024
-
[71]
arXiv:2405.14831 [cs.CL] https://arxiv.org/abs/2405.14831
Bernal Jim \' e nez Guti \' e rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models. CoRR, abs/2405.14831, 2024. doi:10.48550/ARXIV.2405.14831. URL https://doi.org/10.48550/arXiv.2405.14831
-
[72]
Unsupervised dense information retrieval with contrastive learning
Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022, 2022. URL https://openreview.net/forum?id=jKN1pXi7b0
work page 2022
-
[73]
Huiqiang Jiang, Qianhui Wu, Chin - Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 13358--13376...
-
[74]
Needle in a haystack - pressure testing llms
Gregory Kamradt. Needle in a haystack - pressure testing llms. GitHub, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack
work page 2023
-
[75]
Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, and Edward Choi. Dialsim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents, 2024. URL https://arxiv.org/abs/2406.13144
-
[76]
Reformer: The Efficient Transformer
Nikita Kitaev, ukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020
work page internal anchor Pith review arXiv 2001
-
[77]
Hello again! llm-powered personalized agent for long-term dialogue
Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat - Seng Chua. Hello again! llm-powered personalized agent for long-term dialogue. CoRR, abs/2406.05925, 2024. doi:10.48550/ARXIV.2406.05925. URL https://doi.org/10.48550/arXiv.2406.05925
-
[78]
Available: https://doi.org/10.1162/tacl a 00449
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguistics, 12: 0 157--173, 2024. doi:10.1162/TACL\_A\_00638. URL https://doi.org/10.1162/tacl\_a\_00638
work page internal anchor Pith review doi:10.1162/tacl 2024
-
[79]
G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment
Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G -eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 2511--2522, Singapore, December 2023. Association for Computational ...
-
[80]
Evaluating very long-term conversational memory of LLM agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 13851--13870, Bangkok...
-
[81]
Announcing microsoft copilot, your everyday ai companion, 2023
Microsoft. Announcing microsoft copilot, your everyday ai companion, 2023. URL https://blogs.microsoft.com/blog/2023/09/21/announcing-microsoft-copilot-your-everyday-ai-companion/. Accessed: September 15, 2024
work page 2023
-
[82]
Mistral nemo: Our new best small model
Mistral AI Team . Mistral nemo: Our new best small model. Mistral AI, July 2024. URL https://mistral.ai/news/mistral-nemo
work page 2024
-
[83]
Jesse Mu, Xiang Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...
work page 2023
-
[84]
MTEB: Massive text embedding benchmark
Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2014--2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...
-
[85]
OpenAI. Chatgpt, 2022. URL https://chat.openai.com/chat. Accessed: September 15, 2024
work page 2022
-
[86]
Memory and new controls for chatgpt
OpenAI . Memory and new controls for chatgpt. https://openai.com/index/memory-and-new-controls-for-chatgpt/, 2024. Accessed: September 15, 2024
work page 2024
-
[87]
Siva Reddy, Danqi Chen, and Christopher D. Manning. C o QA : A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7: 0 249--266, 2019. doi:10.1162/tacl_a_00266. URL https://aclanthology.org/Q19-1016/
-
[88]
The probabilistic relevance framework: Bm25 and beyond
Stephen E. Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3 0 (4): 0 333--389, 2009. doi:10.1561/1500000019. URL https://doi.org/10.1561/1500000019
-
[89]
Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: recursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=GN921JHCRw
work page 2024
-
[90]
Chi, Nathanael Sch \" a rli, and Denny Zhou
Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch \" a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 202...
work page 2023
-
[91]
REPLUG: retrieval-augmented black-box language models
Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. REPLUG : Retrieval-augmented black-box language models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...
-
[92]
Language model information retrieval with document expansion
Tao Tao, Xuanhui Wang, Qiaozhu Mei, and ChengXiang Zhai. Language model information retrieval with document expansion. In Robert C. Moore, Jeff Bilmes, Jennifer Chu-Carroll, and Mark Sanderson (eds.), Proceedings of the Human Language Technology Conference of the NAACL , Main Conference , pp.\ 407--414, New York City, USA, June 2006. Association for Compu...
work page 2006
-
[93]
Qwen2.5: A party of foundation models, September 2024
Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/
work page 2024
-
[94]
Augmenting language models with long-term memory
Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Ne...
work page 2023
-
[95]
A ir D ialogue: An environment for goal-oriented dialogue research
Wei Wei, Quoc Le, Andrew Dai, and Jia Li. A ir D ialogue: An environment for goal-oriented dialogue research. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 3844--3854, Brussels, Belgium, October-November 2018. Association for Comp...
-
[96]
Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014
work page Pith review arXiv 2014
-
[97]
N., Hutchins, D., and Szegedy, C
Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022
-
[98]
Baize: An open-source chat model with parameter-efficient tuning on self-chat data
Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 6268--6278, Singapore, December 2023. Association for Computational Linguist...
-
[99]
RECOMP: improving retrieval-augmented lms with context compression and selective augmentation
Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: improving retrieval-augmented lms with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=mlJLVigNHp
work page 2024
-
[100]
Beyond Goldfish Memory: Long-Term Open-Domain Conversation
Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldfish memory: Long-term open-domain conversation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5180--5197, Dublin, Ireland, May 2022 a . Association for Computationa...
-
[101]
Long time no see! open-domain conversation with long-term persona memory
Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, Hua Wu, Haifeng Wang, and Shihang Wang. Long time no see! open-domain conversation with long-term persona memory. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 2639--2650, Dublin, Ireland, May 2022 b . Associati...
-
[102]
Fan Yin, Jesse Vig, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. Did you read the instructions? rethinking the effectiveness of task definitions in instruction learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...
-
[103]
Chain-of-note: Enhancing robustness in retrieval-augmented language models
Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210, 2023
-
[104]
Dun Zhang. STELLA EN 1.5B v5 . https://huggingface.co/dunzhang/stella_en_1.5B_v5, 2023. Accessed: September 15, 2024
work page 2023
-
[105]
Cognitive kernel: An open-source agent system towards generalist autopilots, 2024
Hongming Zhang, Xiaoman Pan, Hongwei Wang, Kaixin Ma, Wenhao Yu, and Dong Yu. Cognitive kernel: An open-source agent system towards generalist autopilots, 2024. URL https://arxiv.org/abs/2409.10277
-
[106]
S ituated QA : Incorporating extra-linguistic contexts into QA
Michael Zhang and Eunsol Choi. S ituated QA : Incorporating extra-linguistic contexts into QA . In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 7371--7387, Online and Punta Cana, Dominican Republic, November 2021. Association f...
-
[107]
Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023
work page 2023
-
[108]
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 20...
-
[109]
Training language models with memory augmentation
Zexuan Zhong, Tao Lei, and Danqi Chen. Training language models with memory augmentation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 5657--5673, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.