arxiv: 2410.10813 · v2 · submitted 2024-10-14 · 💻 cs.CL

Recognition: 3 theorem links

· Lean Theorem

LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory

Di Wu, Dong Yu, Hongwei Wang, Kai-Wei Chang, Wenhao Yu, Yuwei Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-11 16:35 UTC · model grok-4.3

classification 💻 cs.CL

keywords long-term memorychat assistantsLLM benchmarksmemory systemsconversational AImulti-session reasoningtemporal reasoning

0 comments

The pith

Chat assistants lose 30 percent accuracy on long-term memory benchmark

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents LongMemEval, a benchmark to assess chat assistants' ability to maintain memory over extended, multi-session interactions. It tests five specific abilities: extracting information, reasoning across sessions, temporal reasoning, handling knowledge updates, and knowing when to abstain from answering. Commercial systems and long-context models struggle, showing roughly a 30 percent accuracy decline compared to shorter contexts. The authors outline a three-stage memory framework and propose optimizations in indexing and retrieval that lead to better recall and answering performance on the benchmark. This work highlights the gap between current capabilities and the needs for reliable, ongoing conversational AI.

Core claim

LongMemEval introduces a benchmark of 500 questions set in freely scalable user-assistant chat histories to evaluate five core long-term memory abilities. It reveals that existing commercial chat assistants and long-context LLMs suffer a 30% accuracy drop in memorizing information across sustained interactions. A unified framework breaking long-term memory into indexing, retrieval, and reading stages, enhanced by optimizations such as session decomposition, fact-augmented key expansion, and time-aware query expansion, substantially improves memory recall and downstream question answering.

What carries the argument

LongMemEval benchmark testing five memory abilities via 500 questions in chat histories, together with a three-stage memory design (indexing, retrieval, reading) and associated optimizations for granularity, indexing, and search scope.

If this is right

The proposed optimizations can be applied to improve memory performance in LLM-based chat systems.
Explicit session management and time awareness help overcome limitations of pure long-context approaches.
Enhanced memory recall directly boosts the accuracy of answers to user questions in long interactions.
The benchmark allows testing memory capabilities at scale without constraining interaction length.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Pure reliance on longer context windows may not suffice for interactive memory needs.
These design principles could enable chat assistants to handle months-long conversations more effectively.
Benchmarks like this could be extended to evaluate memory in other AI applications such as personal agents.

Load-bearing premise

That the five core abilities and the 500 curated questions comprehensively capture the long-term memory requirements of real sustained user-assistant interactions.

What would settle it

Conducting real-world user studies over multiple sessions and checking whether LongMemEval scores correlate with observed memory lapses in actual deployments.

read the original abstract

Recent large language model (LLM)-driven chat assistant systems have integrated memory components to track user-assistant chat histories, enabling more accurate and personalized responses. However, their long-term memory capabilities in sustained interactions remain underexplored. We introduce LongMemEval, a comprehensive benchmark designed to evaluate five core long-term memory abilities of chat assistants: information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention. With 500 meticulously curated questions embedded within freely scalable user-assistant chat histories, LongMemEval presents a significant challenge to existing long-term memory systems, with commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions. We then present a unified framework that breaks down the long-term memory design into three stages: indexing, retrieval, and reading. Built upon key experimental insights, we propose several memory design optimizations including session decomposition for value granularity, fact-augmented key expansion for indexing, and time-aware query expansion for refining the search scope. Extensive experiments show that these optimizations greatly improve both memory recall and downstream question answering on LongMemEval. Overall, our study provides valuable resources and guidance for advancing the long-term memory capabilities of LLM-based chat assistants, paving the way toward more personalized and reliable conversational AI. Our benchmark and code are publicly available at https://github.com/xiaowu0162/LongMemEval.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

LongMemEval gives a practical new benchmark for long-term chat memory with five targeted abilities and a three-stage framework, but the questions lack external validation against real interactions.

read the letter

The paper introduces LongMemEval, a benchmark of 500 questions that tests information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention in extended chat histories. It reports a roughly 30% accuracy drop for commercial assistants and long-context models, then shows gains from three optimizations inside a simple indexing-retrieval-reading pipeline: session decomposition, fact-augmented key expansion, and time-aware query expansion. The code and benchmark are released publicly, which is the most immediately useful part of the work.

Referee Report

2 major / 2 minor

Summary. The paper introduces LongMemEval, a benchmark with 500 curated questions testing five core long-term memory abilities in chat assistants (information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention). It reports that commercial chat assistants and long-context LLMs exhibit a ~30% accuracy drop on sustained interactions, then proposes a three-stage memory framework (indexing, retrieval, reading) with optimizations including session decomposition, fact-augmented key expansion, and time-aware query expansion. Experiments show these optimizations improve recall and QA performance, and the benchmark plus code are released publicly.

Significance. If the benchmark's representativeness holds, the work supplies a needed evaluation resource for long-term memory in conversational AI and practical design guidance via the staged framework and optimizations. The public release of the benchmark and code is a clear strength that supports reproducibility and community follow-up.

major comments (2)

[Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim that LongMemEval 'presents a significant challenge' with a 30% accuracy drop rests on the 500 questions comprehensively capturing real sustained interactions, yet the curation process provides no inter-annotator agreement scores, coverage/saturation metrics, or comparison to actual user logs. This directly affects whether the observed drop and optimization gains generalize beyond the benchmark.
[§4 and tables] §4 (Experiments) and associated tables: The 30% drop and subsequent gains from the three optimizations are reported without details on baseline implementations (e.g., exact prompting or retrieval setups for commercial systems), error analysis by ability type, or statistical significance testing, making it difficult to verify robustness of the headline empirical results.

minor comments (2)

[§2] The five abilities are introduced without explicit justification or citation to prior work on memory taxonomies; adding a short related-work paragraph would clarify novelty.
[Figures in §4] Figure captions and axis labels in the experimental plots could be expanded to include exact metric definitions and confidence intervals for easier interpretation.

Simulated Author's Rebuttal

2 responses · 1 unresolved

We thank the referee for the constructive feedback on our manuscript. We address each major comment below and have revised the paper accordingly where feasible to improve clarity and rigor.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Benchmark Construction): The central claim that LongMemEval 'presents a significant challenge' with a 30% accuracy drop rests on the 500 questions comprehensively capturing real sustained interactions, yet the curation process provides no inter-annotator agreement scores, coverage/saturation metrics, or comparison to actual user logs. This directly affects whether the observed drop and optimization gains generalize beyond the benchmark.

Authors: We acknowledge the importance of demonstrating the benchmark's representativeness. The questions were developed by a small team of researchers following explicit guidelines to target the five memory abilities across varying interaction lengths and complexities. In the revised manuscript, we have expanded Section 3 with a fuller description of the curation protocol, added inter-annotator agreement scores computed on a held-out sample of questions (Cohen's kappa > 0.8), and included coverage metrics that quantify distribution across session counts and reasoning depths. We cannot, however, supply a direct comparison against proprietary user logs from commercial platforms. We maintain that the controlled, ability-focused design still yields a meaningful challenge, as the consistent performance degradation across diverse systems supports the headline findings. revision: partial
Referee: [§4 and tables] §4 (Experiments) and associated tables: The 30% drop and subsequent gains from the three optimizations are reported without details on baseline implementations (e.g., exact prompting or retrieval setups for commercial systems), error analysis by ability type, or statistical significance testing, making it difficult to verify robustness of the headline empirical results.

Authors: We agree that greater transparency is required. The revised version adds an appendix that documents the precise prompting templates, API parameters, and retrieval configurations employed for each commercial chat assistant and long-context LLM baseline. We have also inserted a new error analysis subsection that decomposes accuracy by the five ability categories, revealing the largest drops in multi-session reasoning and knowledge updates. Finally, we report paired statistical significance tests (with p-values) on the accuracy differences between original and optimized memory systems, confirming that the reported gains are statistically reliable. revision: yes

standing simulated objections not resolved

Direct comparison of benchmark questions to proprietary real-world user logs from commercial chat systems

Circularity Check

0 steps flagged

No circularity: benchmark creation and empirical optimizations are self-contained.

full rationale

The paper introduces LongMemEval as a curated benchmark for five memory abilities and reports empirical results showing performance drops and gains from three-stage optimizations (indexing, retrieval, reading). No equations, first-principles derivations, or predictions appear in the provided text. Claims rest on manual curation of 500 questions and experimental measurements rather than any self-definitional loop, fitted parameter renamed as prediction, or load-bearing self-citation chain. The skeptic concern about benchmark coverage is a validity issue, not circularity per the enumerated patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claims rest on the domain assumption that the five listed abilities form a complete set for long-term memory evaluation; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption The five abilities (information extraction, multi-session reasoning, temporal reasoning, knowledge updates, and abstention) represent the core long-term memory capabilities of chat assistants.
Explicitly stated as the basis for the benchmark design in the abstract.

pith-pipeline@v0.9.0 · 5563 in / 1313 out tokens · 52046 ms · 2026-05-11T16:35:00.534867+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

Foundation.HierarchyEmergence hierarchy_emergence_forces_phi unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

we propose several memory design optimizations including session decomposition for value granularity, fact-augmented key expansion for indexing, and time-aware query expansion for refining the search scope.
Foundation.LawOfExistence defect_zero_iff_one unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

commercial chat assistants and long-context LLMs showing a 30% accuracy drop on memorizing information across sustained interactions.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 34 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

GroupMemBench: Benchmarking LLM Agent Memory in Multi-Party Conversations
cs.CL 2026-05 conditional novelty 8.0

GroupMemBench shows leading LLM memory systems reach only 46% average accuracy on multi-party tasks, with a simple BM25 baseline matching or beating most of them.
MedMemoryBench: Benchmarking Agent Memory in Personalized Healthcare
cs.AI 2026-05 conditional novelty 8.0

MedMemoryBench supplies a 2,000-session synthetic medical trajectory dataset and an evaluate-while-constructing streaming protocol to expose memory saturation and reasoning failures in current agent architectures for ...
Goal-Oriented Reasoning for RAG-based Memory in Conversational Agentic LLM Systems
cs.AI 2026-05 unverdicted novelty 7.0

Goal-Mem improves RAG memory retrieval in agentic LLMs by explicit goal decomposition and backward chaining via Natural Language Logic, outperforming nine baselines on multi-hop and implicit inference tasks.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Nautilus Compass: Black-box Persona Drift Detection for Production LLM Agents
cs.CR 2026-05 unverdicted novelty 7.0

Nautilus Compass is a black-box drift detector for production LLM agents that uses weighted cosine similarity on BGE-m3 embeddings of raw text against anchors, achieving 0.83 ROC AUC on real session traces while shipp...
When Stored Evidence Stops Being Usable: Scale-Conditioned Evaluation of Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

A new evaluation protocol shows agent memory reliability degrades variably with added irrelevant sessions depending on agent, memory interface, and scale.
MemFlow: Intent-Driven Memory Orchestration for Small Language Model Agents
cs.MA 2026-05 unverdicted novelty 7.0

MemFlow routes queries by intent to tiered memory operations, nearly doubling accuracy of a 1.7B SLM on long-horizon benchmarks compared to full-context baselines.
MEMAUDIT: An Exact Package-Oracle Evaluation Protocol for Budgeted Long-Term LLM Memory Writing
cs.AI 2026-05 unverdicted novelty 7.0

MEMAUDIT is a new exact optimization protocol for evaluating budgeted LLM memory writing that uses package-oracle fixes and MILP solvers to separate representation quality, validity preservation, and selection effects.
NeuroState-Bench: A Human-Calibrated Benchmark for Commitment Integrity in LLM Agent Profiles
cs.AI 2026-05 accept novelty 7.0

NeuroState-Bench is a human-calibrated benchmark with 144 tasks and 306 side-query probes showing that commitment integrity in LLM agent profiles diverges from task success, with 31 of 32 profiles changing rank under ...
From Recall to Forgetting: Benchmarking Long-Term Memory for Personalized Agents
cs.CL 2026-04 unverdicted novelty 7.0

Memora benchmark and FAMA metric show that LLMs and memory agents frequently reuse invalid memories and struggle to reconcile evolving information in long-term interactions.
Four-Axis Decision Alignment for Long-Horizon Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 7.0

Long-horizon enterprise AI agents' decisions decompose into four measurable axes, with benchmark experiments on six memory architectures revealing distinct weaknesses and reversing a pre-registered prediction on summa...
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
cs.AI 2026-04 unverdicted novelty 7.0

ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
A-MBER: Affective Memory Benchmark for Emotion Recognition
cs.AI 2026-04 unverdicted novelty 7.0

A-MBER is a new benchmark for evaluating AI models on using interaction history to recognize and explain a user's present affective state across judgment, retrieval, and explanation tasks.
Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory
cs.CL 2025-11 unverdicted novelty 7.0

Evo-Memory is a new benchmark for self-evolving memory in LLM agents across task streams, with baseline ExpRAG and proposed ReMem method that integrates reasoning, actions, and memory updates for continual improvement.
SAGE: A Self-Evolving Agentic Graph-Memory Engine for Structure-Aware Associative Memory
cs.AI 2026-05 unverdicted novelty 6.0

SAGE is a self-evolving agentic graph-memory engine that dynamically constructs and refines structured memory graphs via writer-reader feedback, yielding performance gains on multi-hop QA, open-domain retrieval, and l...
HAGE: Harnessing Agentic Memory via RL-Driven Weighted Graph Evolution
cs.AI 2026-05 unverdicted novelty 6.0

HAGE proposes a trainable weighted graph memory framework with LLM intent classification, dynamic edge modulation, and RL optimization that improves long-horizon reasoning accuracy in agentic LLMs over static baselines.
Storage Is Not Memory: A Retrieval-Centered Architecture for Agent Recall
cs.CL 2026-05 conditional novelty 6.0

True Memory is a verbatim-event retrieval pipeline running on a single SQLite file that reaches 93% accuracy on LoCoMo multi-session questions, outperforming Mem0, Supermemory, Zep, and matching or exceeding EverMemOS...
Feedback-Normalized Developer Memory for Reinforcement-Learning Coding Agents: A Safety-Gated MCP Architecture
cs.SE 2026-05 unverdicted novelty 6.0

RL Developer Memory is a feedback-normalized, safety-gated memory architecture for RL coding agents that logs contextual decisions and applies conservative off-policy gates to maintain 80% decision accuracy and full h...
MemRouter: Memory-as-Embedding Routing for Long-Term Conversational Agents
cs.CL 2026-05 unverdicted novelty 6.0

A lightweight supervised router using frozen-LLM embeddings for memory admission decisions outperforms LLM-based memory managers in both F1 score and latency on the LoCoMo benchmark.
From Unstructured Recall to Schema-Grounded Memory: Reliable AI Memory via Iterative, Schema-Aware Extraction
cs.AI 2026-04 unverdicted novelty 6.0

Schema-aware iterative extraction turns AI memory into a verified system of record, reaching 90-97% accuracy on extraction and end-to-end memory benchmarks where retrieval baselines score 80-87%.
Memanto: Typed Semantic Memory with Information-Theoretic Retrieval for Long-Horizon Agents
cs.AI 2026-04 unverdicted novelty 6.0

Memanto delivers 89.8% and 87.1% accuracy on LongMemEval and LoCoMo benchmarks using typed semantic memory and information-theoretic retrieval, outperforming hybrid graph and vector systems with a single query and zer...
Stateless Decision Memory for Enterprise AI Agents
cs.AI 2026-04 unverdicted novelty 6.0

Deterministic Projection Memory (DPM) delivers stateless, deterministic decision memory for enterprise AI agents that matches or exceeds summarization-based approaches at tight memory budgets while improving speed, de...
LLMs Corrupt Your Documents When You Delegate
cs.CL 2026-04 unverdicted novelty 6.0

LLMs corrupt an average of 25% of document content during long delegated editing workflows across 52 domains, even frontier models, and agentic tools do not mitigate the issue.
Drawing on Memory: Dual-Trace Encoding Improves Cross-Session Recall in LLM Agents
cs.AI 2026-04 conditional novelty 6.0

Dual-trace encoding improves LLM agent cross-session recall from 53.5% to 73.7% accuracy by storing facts alongside concrete scene reconstructions, with largest gains in temporal reasoning and multi-session aggregation.
Trust Your Memory: Verifiable Control of Smart Homes through Reinforcement Learning with Multi-dimensional Rewards
cs.AI 2026-04 unverdicted novelty 6.0

Introduces MemHome benchmark and RL with multi-dimensional rewards for memory-driven smart home device control.
MemReader: From Passive to Active Extraction for Long-Term Agent Memory
cs.CL 2026-04 unverdicted novelty 6.0

MemReader uses distilled passive and GRPO-trained active extractors to selectively write low-noise long-term memories, outperforming passive baselines on knowledge updating, temporal reasoning, and hallucination tasks.
FileGram: Grounding Agent Personalization in File-System Behavioral Traces
cs.CV 2026-04 unverdicted novelty 6.0

FileGram grounds AI agent personalization in file-system behavioral traces via a data simulation engine, a diagnostic benchmark, and a bottom-up memory architecture.
SelRoute: Query-Type-Aware Routing for Long-Term Conversational Memory Retrieval
cs.IR 2026-04 conditional novelty 6.0

SelRoute routes queries to type-specific retrieval pipelines, achieving Recall@5 of 0.800 with a 109M model on LongMemEval_M and outperforming LLM-augmented baselines including a strong zero-ML lexical method.
LLM-Oriented Information Retrieval: A Denoising-First Perspective
cs.IR 2026-05 unverdicted novelty 5.0

Denoising to maximize usable evidence density and verifiability is becoming the primary bottleneck in LLM-oriented information retrieval, conceptualized via a four-stage framework and addressed through a pipeline taxo...
EngramaBench: Evaluating Long-Term Conversational Memory with Structured Graph Retrieval
cs.CL 2026-04 unverdicted novelty 5.0

EngramaBench shows structured graph memory outperforms full-context prompting on cross-space reasoning in long conversations but scores lower overall than full-context and higher than vector retrieval.
EgoSelf: From Memory to Personalized Egocentric Assistant
cs.CV 2026-04 unverdicted novelty 5.0

EgoSelf uses graph-based memory of user interactions to derive personalized profiles and predict future behaviors for egocentric assistants.
MemMachine: A Ground-Truth-Preserving Memory System for Personalized AI Agents
cs.AI 2026-04 unverdicted novelty 5.0

MemMachine stores entire conversational episodes and applies contextualized retrieval plus adaptive query routing to achieve 0.9169 accuracy on LoCoMo and 93 percent on LongMemEvalS while using 80 percent fewer tokens...
Memory as Metabolism: A Design for Companion Knowledge Systems
cs.AI 2026-04 unverdicted novelty 4.0

This paper designs a companion knowledge system with TRIAGE, DECAY, CONTEXTUALIZE, CONSOLIDATE, and AUDIT operations plus memory gravity and minority-hypothesis retention to give contradictory evidence a path to updat...
A Framework for Longitudinal Health AI Agents
cs.AI 2026-04 unverdicted novelty 4.0

Proposes a multi-layer framework and agent architecture that operationalizes adaptation, coherence, continuity, and agency for longitudinal health AI agents.

Reference graph

Works this paper leans on

76 extracted references · 76 canonical work pages · cited by 34 Pith papers · 5 internal anchors

[1]

Scaling Learning Algorithms Towards

Bengio, Yoshua and LeCun, Yann , booktitle =. Scaling Learning Algorithms Towards

work page
[2]

and Osindero, Simon and Teh, Yee Whye , journal =

Hinton, Geoffrey E. and Osindero, Simon and Teh, Yee Whye , journal =. A Fast Learning Algorithm for Deep Belief Nets , volume =

work page
[3]

2016 , publisher=

Deep learning , author=. 2016 , publisher=

work page 2016
[5]

2024 , eprint=

DialSim: A Real-Time Simulator for Evaluating Long-Term Dialogue Understanding of Conversational Agents , author=. 2024 , eprint=

work page 2024
[10]

Goodman , editor =

Jesse Mu and Xiang Li and Noah D. Goodman , editor =. Learning to Compress Prompts with Gist Tokens , booktitle =. 2023 , url =

work page 2023
[13]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023).https://doi.org/ 10.18653/v1/2023.emnlp-main.391

Yucheng Li and Bo Dong and Frank Guerin and Chenghua Lin , editor =. Compressing Context to Enhance Inference Efficiency of Large Language Models , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.391 , timestamp =

work page doi:10.18653/v1/2023.emnlp-main.391 2023
[14]

The Twelfth International Conference on Learning Representations,

Fangyuan Xu and Weijia Shi and Eunsol Choi , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[15]

Forty-first International Conference on Machine Learning,

Yao Fu and Rameswar Panda and Xinyao Niu and Xiang Yue and Hannaneh Hajishirzi and Yoon Kim and Hao Peng , title =. Forty-first International Conference on Machine Learning,. 2024 , url =

work page 2024
[17]

Chi and Nathanael Sch

Freda Shi and Xinyun Chen and Kanishka Misra and Nathan Scales and David Dohan and Ed H. Chi and Nathanael Sch. Large Language Models Can Be Easily Distracted by Irrelevant Context , booktitle =. 2023 , url =

work page 2023
[19]

Augmenting Language Models with Long-Term Memory , booktitle =

Weizhi Wang and Li Dong and Hao Cheng and Xiaodong Liu and Xifeng Yan and Jianfeng Gao and Furu Wei , editor =. Augmenting Language Models with Long-Term Memory , booktitle =. 2023 , url =

work page 2023
[22]

Manning , title =

Parth Sarthi and Salman Abdullah and Aditi Tuli and Shubh Khanna and Anna Goldie and Christopher D. Manning , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

work page 2024
[23]

2024 , howpublished =

Memory and New Controls for ChatGPT , author =. 2024 , howpublished =

work page 2024
[24]

2024 , howpublished =

Memory Overview Guide , author =. 2024 , howpublished =

work page 2024
[25]

2022 , url =

OpenAI , title =. 2022 , url =

work page 2022
[26]

2023 , url =

Microsoft , title =. 2023 , url =

work page 2023
[27]

2024 , eprint=

Cognitive Kernel: An Open-source Agent System towards Generalist Autopilots , author=. 2024 , eprint=

work page 2024
[28]

2023 , howpublished =

Gregory Kamradt , title =. 2023 , howpublished =

work page 2023
[30]

2023 , eprint=

Judging LLM-as-a-judge with MT-Bench and Chatbot Arena , author=. 2023 , eprint=

work page 2023
[34]

2023 , howpublished =

Zhang, Dun , title =. 2023 , howpublished =

work page 2023
[38]

Gautier Izacard and Mathilde Caron and Lucas Hosseini and Sebastian Riedel and Piotr Bojanowski and Armand Joulin and Edouard Grave , title =. Trans. Mach. Learn. Res. , volume =. 2022 , url =

work page 2022
[40]

Qwen2.5: A Party of Foundation Models , url =

Qwen Team , month =. Qwen2.5: A Party of Foundation Models , url =

work page
[41]

2024 , month =

Mistral NeMo: Our new best small model , journal =. 2024 , month =

work page 2024
[45]

P er LTQA : A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Fusion in Question Answering

Du, Yiming and Wang, Hongru and Zhao, Zhengyi and Liang, Bin and Wang, Baojun and Zhong, Wanjun and Wang, Zezhong and Wong, Kam-Fai. P er LTQA : A Personal Long-Term Memory Dataset for Memory Classification, Retrieval, and Fusion in Question Answering. Proceedings of the 10th SIGHAN Workshop on Chinese Language Processing (SIGHAN-10). 2024

work page 2024
[54]

Language Model Information Retrieval with Document Expansion

Tao, Tao and Wang, Xuanhui and Mei, Qiaozhu and Zhai, ChengXiang. Language Model Information Retrieval with Document Expansion. Proceedings of the Human Language Technology Conference of the NAACL , Main Conference. 2006

work page 2006
[58]

Marah I Abdin, Sam Ade Jacobs, Ammar Ahmad Awan, Jyoti Aneja, Ahmed Awadallah, Hany Awadalla, Nguyen Bach, Amit Bahree, Arash Bakhtiari, Harkirat S. Behl, Alon Benhaim, Misha Bilenko, Johan Bjorck, S \' e bastien Bubeck, Martin Cai, Caio C \' e sar Teodoro Mendes, Weizhu Chen, Vishrav Chaudhary, Parul Chopra, Allie Del Giorno, Gustavo de Rosa, Matthew Dix...

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2404.14219 2024
[59]

Make your LLM fully utilize the context

Shengnan An, Zexiong Ma, Zeqi Lin, Nanning Zheng, and Jian - Guang Lou. Make your LLM fully utilize the context. CoRR, abs/2404.16811, 2024. doi:10.48550/ARXIV.2404.16811. URL https://doi.org/10.48550/arXiv.2404.16811

work page doi:10.48550/arxiv.2404.16811 2024
[60]

Longformer: The Long-Document Transformer

Iz Beltagy, Matthew E. Peters, and Arman Cohan. Longformer: The long-document transformer. arXiv:2004.05150, 2020

work page internal anchor Pith review Pith/arXiv arXiv 2004
[61]

M ulti WOZ - A Large-Scale Multi-Domain W izard-of- O z Dataset for Task-Oriented Dialogue Modelling

Pawe Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, I \ n igo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Ga s i \'c . M ulti WOZ - a large-scale multi-domain W izard-of- O z dataset for task-oriented dialogue modelling. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical ...

work page doi:10.18653/v1/d18-1547 2018
[62]

arXiv preprint arXiv:2310.05029 , year=

Howard Chen, Ramakanth Pasunuru, Jason Weston, and Asli Celikyilmaz. Walking down the memory maze: Beyond context limit through interactive reading. CoRR, abs/2310.05029, 2023 a . doi:10.48550/ARXIV.2310.05029. URL https://doi.org/10.48550/arXiv.2310.05029

work page doi:10.48550/arxiv.2310.05029 2023
[63]

https://doi.org/10

Tong Chen, Hongwei Wang, Sihao Chen, Wenhao Yu, Kaixin Ma, Xinran Zhao, Hongming Zhang, and Dong Yu. Dense X retrieval: What retrieval granularity should we use? CoRR, abs/2312.06648, 2023 b . doi:10.48550/ARXIV.2312.06648. URL https://doi.org/10.48550/arXiv.2312.06648

work page doi:10.48550/arxiv.2312.06648 2023
[64]

Adaptinglanguagemodelstocompresscontexts

Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 3829--3846. Association for Computational Linguistic...

work page doi:10.18653/v1/2023.emnlp-main.232 2023
[65]

Memory overview guide

Coze . Memory overview guide. https://www.coze.com/docs/guides/memory_overview?_lang=en, 2024. Accessed: September 15, 2024

work page 2024
[66]

Enhancing chat language models by scaling high-quality instructional conversations

Ning Ding, Yulin Chen, Bokai Xu, Yujia Qin, Zhi Zheng, Shengding Hu, Zhiyuan Liu, Maosong Sun, and Bowen Zhou. Enhancing chat language models by scaling high-quality instructional conversations. arXiv preprint arXiv:2305.14233, 2023

work page arXiv 2023
[67]

P er LTQA : A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering

Yiming Du, Hongru Wang, Zhengyi Zhao, Bin Liang, Baojun Wang, Wanjun Zhong, Zezhong Wang, and Kam-Fai Wong. P er LTQA : A personal long-term memory dataset for memory classification, retrieval, and fusion in question answering. In Kam-Fai Wong, Min Zhang, Ruifeng Xu, Jing Li, Zhongyu Wei, Lin Gui, Bin Liang, and Runcong Zhao (eds.), Proceedings of the 10t...

work page 2024
[68]

The Llama 3 Herd of Models

Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Amy Yang, Angela Fan, et al. The llama 3 herd of models. arXiv preprint arXiv:2407.21783, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[69]

Improving retrieval of short texts through document expansion

Miles Efron, Peter Organisciak, and Katrina Fenlon. Improving retrieval of short texts through document expansion. In William R. Hersh, Jamie Callan, Yoelle Maarek, and Mark Sanderson (eds.), The 35th International ACM SIGIR conference on research and development in Information Retrieval, SIGIR '12, Portland, OR, USA, August 12-16, 2012 , pp.\ 911--920. A...

work page doi:10.1145/2348283.2348405 2012
[70]

Data engineering for scaling language models to 128k context

Yao Fu, Rameswar Panda, Xinyao Niu, Xiang Yue, Hannaneh Hajishirzi, Yoon Kim, and Hao Peng. Data engineering for scaling language models to 128k context. In Forty-first International Conference on Machine Learning, ICML 2024, Vienna, Austria, July 21-27, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=TaAqeo7lUh

work page 2024
[71]

arXiv:2405.14831 [cs.CL] https://arxiv.org/abs/2405.14831

Bernal Jim \' e nez Guti \' e rrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su. Hipporag: Neurobiologically inspired long-term memory for large language models. CoRR, abs/2405.14831, 2024. doi:10.48550/ARXIV.2405.14831. URL https://doi.org/10.48550/arXiv.2405.14831

work page doi:10.48550/arxiv.2405.14831 2024
[72]

Unsupervised dense information retrieval with contrastive learning

Gautier Izacard, Mathilde Caron, Lucas Hosseini, Sebastian Riedel, Piotr Bojanowski, Armand Joulin, and Edouard Grave. Unsupervised dense information retrieval with contrastive learning. Trans. Mach. Learn. Res., 2022, 2022. URL https://openreview.net/forum?id=jKN1pXi7b0

work page 2022
[73]

In: Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (Dec 2023)

Huiqiang Jiang, Qianhui Wu, Chin - Yew Lin, Yuqing Yang, and Lili Qiu. Llmlingua: Compressing prompts for accelerated inference of large language models. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, EMNLP 2023, Singapore, December 6-10, 2023 , pp.\ 13358--13376...

work page doi:10.18653/v1/2023.emnlp-main.825 2023
[74]

Needle in a haystack - pressure testing llms

Gregory Kamradt. Needle in a haystack - pressure testing llms. GitHub, 2023. URL https://github.com/gkamradt/LLMTest_NeedleInAHaystack

work page 2023
[75]

Dialsim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents

Jiho Kim, Woosog Chay, Hyeonji Hwang, Daeun Kyung, Hyunseung Chung, Eunbyeol Cho, Yohan Jo, and Edward Choi. Dialsim: A real-time simulator for evaluating long-term dialogue understanding of conversational agents, 2024. URL https://arxiv.org/abs/2406.13144

work page arXiv 2024
[76]

Reformer: The Efficient Transformer

Nikita Kitaev, ukasz Kaiser, and Anselm Levskaya. Reformer: The efficient transformer. arXiv preprint arXiv:2001.04451, 2020

work page internal anchor Pith review arXiv 2001
[77]

Hello again! llm-powered personalized agent for long-term dialogue

Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat - Seng Chua. Hello again! llm-powered personalized agent for long-term dialogue. CoRR, abs/2406.05925, 2024. doi:10.48550/ARXIV.2406.05925. URL https://doi.org/10.48550/arXiv.2406.05925

work page doi:10.48550/arxiv.2406.05925 2024
[78]

Available: https://doi.org/10.1162/tacl a 00449

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Trans. Assoc. Comput. Linguistics, 12: 0 157--173, 2024. doi:10.1162/TACL\_A\_00638. URL https://doi.org/10.1162/tacl\_a\_00638

work page internal anchor Pith review doi:10.1162/tacl 2024
[79]

G -Eval: NLG Evaluation using Gpt-4 with Better Human Alignment

Yang Liu, Dan Iter, Yichong Xu, Shuohang Wang, Ruochen Xu, and Chenguang Zhu. G -eval: NLG evaluation using gpt-4 with better human alignment. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 2511--2522, Singapore, December 2023. Association for Computational ...

work page doi:10.18653/v1/2023.emnlp-main.153 2023
[80]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. In Lun-Wei Ku, Andre Martins, and Vivek Srikumar (eds.), Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 13851--13870, Bangkok...

work page doi:10.18653/v1/2024.acl-long.747 2024
[81]

Announcing microsoft copilot, your everyday ai companion, 2023

Microsoft. Announcing microsoft copilot, your everyday ai companion, 2023. URL https://blogs.microsoft.com/blog/2023/09/21/announcing-microsoft-copilot-your-everyday-ai-companion/. Accessed: September 15, 2024

work page 2023
[82]

Mistral nemo: Our new best small model

Mistral AI Team . Mistral nemo: Our new best small model. Mistral AI, July 2024. URL https://mistral.ai/news/mistral-nemo

work page 2024
[83]

Jesse Mu, Xiang Li, and Noah D. Goodman. Learning to compress prompts with gist tokens. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 10 - 1...

work page 2023
[84]

MTEB: Massive text embedding benchmark

Niklas Muennighoff, Nouamane Tazi, Loic Magne, and Nils Reimers. MTEB : Massive text embedding benchmark. In Andreas Vlachos and Isabelle Augenstein (eds.), Proceedings of the 17th Conference of the European Chapter of the Association for Computational Linguistics, pp.\ 2014--2037, Dubrovnik, Croatia, May 2023. Association for Computational Linguistics. d...

work page doi:10.18653/v1/2023.eacl-main.148 2014
[85]

Chatgpt, 2022

OpenAI. Chatgpt, 2022. URL https://chat.openai.com/chat. Accessed: September 15, 2024

work page 2022
[86]

Memory and new controls for chatgpt

OpenAI . Memory and new controls for chatgpt. https://openai.com/index/memory-and-new-controls-for-chatgpt/, 2024. Accessed: September 15, 2024

work page 2024
[87]

Siva Reddy, Danqi Chen, and Christopher D. Manning. C o QA : A conversational question answering challenge. Transactions of the Association for Computational Linguistics, 7: 0 249--266, 2019. doi:10.1162/tacl_a_00266. URL https://aclanthology.org/Q19-1016/

work page doi:10.1162/tacl_a_00266 2019
[88]

The probabilistic relevance framework: Bm25 and beyond

Stephen E. Robertson and Hugo Zaragoza. The probabilistic relevance framework: BM25 and beyond. Found. Trends Inf. Retr., 3 0 (4): 0 333--389, 2009. doi:10.1561/1500000019. URL https://doi.org/10.1561/1500000019

work page doi:10.1561/1500000019 2009
[89]

Parth Sarthi, Salman Abdullah, Aditi Tuli, Shubh Khanna, Anna Goldie, and Christopher D. Manning. RAPTOR: recursive abstractive processing for tree-organized retrieval. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=GN921JHCRw

work page 2024
[90]

Chi, Nathanael Sch \" a rli, and Denny Zhou

Freda Shi, Xinyun Chen, Kanishka Misra, Nathan Scales, David Dohan, Ed H. Chi, Nathanael Sch \" a rli, and Denny Zhou. Large language models can be easily distracted by irrelevant context. In Andreas Krause, Emma Brunskill, Kyunghyun Cho, Barbara Engelhardt, Sivan Sabato, and Jonathan Scarlett (eds.), International Conference on Machine Learning, ICML 202...

work page 2023
[91]

REPLUG: retrieval-augmented black-box language models

Weijia Shi, Sewon Min, Michihiro Yasunaga, Minjoon Seo, Richard James, Mike Lewis, Luke Zettlemoyer, and Wen-tau Yih. REPLUG : Retrieval-augmented black-box language models. In Kevin Duh, Helena Gomez, and Steven Bethard (eds.), Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Languag...

work page doi:10.18653/v1/2024.naacl-long.463 2024
[92]

Language model information retrieval with document expansion

Tao Tao, Xuanhui Wang, Qiaozhu Mei, and ChengXiang Zhai. Language model information retrieval with document expansion. In Robert C. Moore, Jeff Bilmes, Jennifer Chu-Carroll, and Mark Sanderson (eds.), Proceedings of the Human Language Technology Conference of the NAACL , Main Conference , pp.\ 407--414, New York City, USA, June 2006. Association for Compu...

work page 2006
[93]

Qwen2.5: A party of foundation models, September 2024

Qwen Team. Qwen2.5: A party of foundation models, September 2024. URL https://qwenlm.github.io/blog/qwen2.5/

work page 2024
[94]

Augmenting language models with long-term memory

Weizhi Wang, Li Dong, Hao Cheng, Xiaodong Liu, Xifeng Yan, Jianfeng Gao, and Furu Wei. Augmenting language models with long-term memory. In Alice Oh, Tristan Naumann, Amir Globerson, Kate Saenko, Moritz Hardt, and Sergey Levine (eds.), Advances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, Ne...

work page 2023
[95]

A ir D ialogue: An environment for goal-oriented dialogue research

Wei Wei, Quoc Le, Andrew Dai, and Jia Li. A ir D ialogue: An environment for goal-oriented dialogue research. In Ellen Riloff, David Chiang, Julia Hockenmaier, and Jun ' ichi Tsujii (eds.), Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing, pp.\ 3844--3854, Brussels, Belgium, October-November 2018. Association for Comp...

work page doi:10.18653/v1/d18-1419 2018
[96]

Memory Networks

Jason Weston, Sumit Chopra, and Antoine Bordes. Memory networks. arXiv preprint arXiv:1410.3916, 2014

work page Pith review arXiv 2014
[97]

N., Hutchins, D., and Szegedy, C

Yuhuai Wu, Markus N Rabe, DeLesley Hutchins, and Christian Szegedy. Memorizing transformers. arXiv preprint arXiv:2203.08913, 2022

work page arXiv 2022
[98]

Baize: An open-source chat model with parameter-efficient tuning on self-chat data

Canwen Xu, Daya Guo, Nan Duan, and Julian McAuley. Baize: An open-source chat model with parameter-efficient tuning on self-chat data. In Houda Bouamor, Juan Pino, and Kalika Bali (eds.), Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pp.\ 6268--6278, Singapore, December 2023. Association for Computational Linguist...

work page doi:10.18653/v1/2023.emnlp-main.385 2023
[99]

RECOMP: improving retrieval-augmented lms with context compression and selective augmentation

Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: improving retrieval-augmented lms with context compression and selective augmentation. In The Twelfth International Conference on Learning Representations, ICLR 2024, Vienna, Austria, May 7-11, 2024 . OpenReview.net, 2024. URL https://openreview.net/forum?id=mlJLVigNHp

work page 2024
[100]

Beyond Goldfish Memory: Long-Term Open-Domain Conversation

Jing Xu, Arthur Szlam, and Jason Weston. Beyond goldfish memory: Long-term open-domain conversation. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 5180--5197, Dublin, Ireland, May 2022 a . Association for Computationa...

work page doi:10.18653/v1/2022.acl-long.356 2022
[101]

Long time no see! open-domain conversation with long-term persona memory

Xinchao Xu, Zhibin Gou, Wenquan Wu, Zheng-Yu Niu, Hua Wu, Haifeng Wang, and Shihang Wang. Long time no see! open-domain conversation with long-term persona memory. In Smaranda Muresan, Preslav Nakov, and Aline Villavicencio (eds.), Findings of the Association for Computational Linguistics: ACL 2022, pp.\ 2639--2650, Dublin, Ireland, May 2022 b . Associati...

work page doi:10.18653/v1/2022.findings-acl.207 2022
[102]

Did you read the instructions? rethinking the effectiveness of task definitions in instruction learning

Fan Yin, Jesse Vig, Philippe Laban, Shafiq Joty, Caiming Xiong, and Chien-Sheng Wu. Did you read the instructions? rethinking the effectiveness of task definitions in instruction learning. In Anna Rogers, Jordan Boyd-Graber, and Naoaki Okazaki (eds.), Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long P...

work page doi:10.18653/v1/2023.acl-long.172 2023
[103]

Chain-of-note: Enhancing robustness in retrieval-augmented language models

Wenhao Yu, Hongming Zhang, Xiaoman Pan, Kaixin Ma, Hongwei Wang, and Dong Yu. Chain-of-note: Enhancing robustness in retrieval-augmented language models. arXiv preprint arXiv:2311.09210, 2023

work page arXiv 2023
[104]

STELLA EN 1.5B v5

Dun Zhang. STELLA EN 1.5B v5 . https://huggingface.co/dunzhang/stella_en_1.5B_v5, 2023. Accessed: September 15, 2024

work page 2023
[105]

Cognitive kernel: An open-source agent system towards generalist autopilots, 2024

Hongming Zhang, Xiaoman Pan, Hongwei Wang, Kaixin Ma, Wenhao Yu, and Dong Yu. Cognitive kernel: An open-source agent system towards generalist autopilots, 2024. URL https://arxiv.org/abs/2409.10277

work page arXiv 2024
[106]

S ituated QA : Incorporating extra-linguistic contexts into QA

Michael Zhang and Eunsol Choi. S ituated QA : Incorporating extra-linguistic contexts into QA . In Marie-Francine Moens, Xuanjing Huang, Lucia Specia, and Scott Wen-tau Yih (eds.), Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 7371--7387, Online and Punta Cana, Dominican Republic, November 2021. Association f...

work page doi:10.18653/v1/2021.emnlp-main.586 2021
[107]

P Xing, Hao Zhang, Joseph E

Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric. P Xing, Hao Zhang, Joseph E. Gonzalez, and Ion Stoica. Judging llm-as-a-judge with mt-bench and chatbot arena, 2023

work page 2023
[108]

my old area,

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: Enhancing large language models with long-term memory. In Michael J. Wooldridge, Jennifer G. Dy, and Sriraam Natarajan (eds.), Thirty-Eighth AAAI Conference on Artificial Intelligence, AAAI 2024, Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence, IAAI 20...

work page doi:10.1609/aaai.v38i17.29946 2024
[109]

Training language models with memory augmentation

Zexuan Zhong, Tao Lei, and Danqi Chen. Training language models with memory augmentation. In Yoav Goldberg, Zornitsa Kozareva, and Yue Zhang (eds.), Proceedings of the 2022 Conference on Empirical Methods in Natural Language Processing, pp.\ 5657--5673, Abu Dhabi, United Arab Emirates, December 2022. Association for Computational Linguistics. doi:10.18653...

work page doi:10.18653/v1/2022.emnlp-main.382 2022