DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA
Pith reviewed 2026-05-22 07:08 UTC · model grok-4.3
The pith
DeferMem distills query-specific evidence from long histories at query time using reinforcement learning to boost QA accuracy and efficiency.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper claims that post-retrieval evidence distillation can be cast as a structured RL action of message selection plus rewriting, optimized through a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, so that high-recall but noisy candidates are turned into faithful, query-specific evidence without requiring pre-query memory processing.
What carries the argument
DistillPO, the reinforcement learning algorithm that formulates evidence distillation as a structured action of selecting and rewriting retrieved messages, then optimizes it with decomposed rewards that gate from validity checks to quality checks and assign advantages to responsible output spans.
If this is right
- On LoCoMo and LongMemEval-S it reaches the highest QA accuracy among the tested systems.
- It delivers the fastest runtime for memory operations compared with strong baselines.
- It incurs zero commercial-API token cost for all memory-related steps.
- It improves both answer accuracy and overall memory-system efficiency over pre-processed memory approaches.
Where Pith is reading between the lines
- The same query-time selection-plus-rewriting pattern could be tested on other retrieval-augmented tasks where the query arrives after the full context is stored.
- If the decomposed reward design generalizes, it might reduce the need for expensive pre-computation of memory summaries in long-running agent deployments.
- The approach suggests a way to keep raw history intact while still producing compact evidence, which could help systems handle histories that grow without bound.
Load-bearing premise
That casting post-retrieval distillation as a structured RL action of selection and rewriting, with decomposed rewards, will reliably produce faithful query-conditioned outputs without introducing new hallucinations or omissions.
What would settle it
A set of long conversational test cases in which the distilled evidence either drops a key supporting fact present in the raw history or adds a fabricated detail, causing the final QA answer to be incorrect even when retrieval recall was high.
Figures
read the original abstract
Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript presents DeferMem, a long-term memory QA framework that decouples high-recall candidate retrieval (via a lightweight segment-link structure over conversational history) from query-conditioned evidence distillation. The distillation step is performed by a policy trained with DistillPO, which formulates the task as a structured RL action (message selection plus rewriting) optimized via a decomposed-and-gated reward pipeline and structure-aligned advantage assignment. The paper reports that DeferMem achieves the highest QA accuracy on LoCoMo and LongMemEval-S while also delivering the fastest runtime and zero commercial-API token cost for memory operations.
Significance. If the DistillPO training reliably yields faithful, query-specific evidence without introducing hallucinations or omissions, the query-time distillation approach could meaningfully improve efficiency and accuracy for LLM agents operating over long histories, by avoiding both pre-query compression and post-retrieval denoising.
major comments (2)
- [§3.2] §3.2 (DistillPO formulation): The structure-aligned advantage assignment is presented as localizing each reward component to its responsible output span, yet the manuscript provides no explicit mechanism or analysis showing how message-level validity gates prevent unfaithful rewrites that add plausible but non-entailed content. This assumption is load-bearing for the central claim that the distilled evidence remains faithful while improving downstream QA accuracy.
- [Experimental results] Experimental results (LoCoMo and LongMemEval-S tables): The headline performance claims rest on benchmark wins, but the text supplies no quantitative ablations on individual reward components, no error analysis of hallucination or omission rates in the distilled outputs, and no comparison against non-RL distillation baselines. Without these, it is difficult to attribute gains specifically to the RL design rather than the retrieval stage.
minor comments (2)
- [Abstract] The abstract states 'zero commercial-API token cost for memory operations' but does not clarify whether the distiller itself incurs any external API usage during training or inference.
- [§3.2] Notation for the decomposed reward terms (validity, quality, task correctness) is introduced without a consolidated table or equation summarizing their gating logic and weighting.
Simulated Author's Rebuttal
We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will improve the clarity and evidential support for our claims.
read point-by-point responses
-
Referee: [§3.2] §3.2 (DistillPO formulation): The structure-aligned advantage assignment is presented as localizing each reward component to its responsible output span, yet the manuscript provides no explicit mechanism or analysis showing how message-level validity gates prevent unfaithful rewrites that add plausible but non-entailed content. This assumption is load-bearing for the central claim that the distilled evidence remains faithful while improving downstream QA accuracy.
Authors: We appreciate the referee highlighting the need for greater explicitness here. The decomposed-and-gated reward pipeline in §3.2 applies a message-level validity gate before any rewriting step, with invalid selections receiving zero reward and being excluded from further processing. This is intended to block propagation of unfaithful content. However, we agree the current text does not provide sufficient formalization or supporting analysis of this gating behavior. In the revision we will expand §3.2 with pseudocode for the validity gate, a step-by-step description of how it interacts with the structure-aligned advantage assignment, and qualitative examples illustrating prevention of non-entailed additions. revision: yes
-
Referee: [Experimental results] Experimental results (LoCoMo and LongMemEval-S tables): The headline performance claims rest on benchmark wins, but the text supplies no quantitative ablations on individual reward components, no error analysis of hallucination or omission rates in the distilled outputs, and no comparison against non-RL distillation baselines. Without these, it is difficult to attribute gains specifically to the RL design rather than the retrieval stage.
Authors: We agree that these analyses are important for isolating the contribution of DistillPO. The present experiments emphasize end-to-end QA accuracy and efficiency, but do not include the requested breakdowns. In the revised manuscript we will add quantitative ablations that remove or ablate individual reward components, a dedicated error analysis reporting hallucination and omission rates via manual inspection of a representative sample of distilled outputs, and direct comparisons against non-RL distillation baselines (such as prompting-based selection and rewriting without reinforcement learning). These results will be placed in the experimental section or an appendix. revision: yes
Circularity Check
No significant circularity; results measured on external benchmarks
full rationale
The paper introduces DeferMem and its DistillPO RL procedure as a proposed workflow (high-recall retrieval followed by query-time structured distillation via selection+rewriting), then reports empirical QA accuracy, runtime, and zero commercial token cost on the independent external benchmarks LoCoMo and LongMemEval-S. These headline metrics are obtained after training and are not algebraically or statistically forced by the decomposed rewards, validity gates, or structure-aligned advantage assignment; they constitute an independent evaluation. No equations, self-citations, or uniqueness theorems appear in the supplied text that would reduce the claimed performance to a re-labeling of the training inputs. The derivation chain is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Foundation/RealityFromDistinction.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting... decomposed-and-gated reward pipeline and structure-aligned advantage assignment
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
-
[1]
Nuo Chen, Hongguang Li, Jianhui Chang, Juhua Huang, Baoyuan Wang, and Jia Li. Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations. InProceedings of the 31st International Conference on Computational Linguistics, pages 755–773, Abu Dhabi, UAE, 2025. Association for Computational Linguistics. URL https: ...
work page 2025
-
[2]
Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory
Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory, 2025. URL https: //arxiv.org/abs/2504.19413
work page internal anchor Pith review Pith/arXiv arXiv 2025
- [3]
-
[4]
Pan, Yuxin Jiang, and Kam-Fai Wong
Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang XUE, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, and Kam-Fai Wong. Memory-t1: Reinforcement learning for temporal reasoning in multi-session agents. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openr...
work page 2026
-
[5]
From Local to Global: A Graph RAG Approach to Query-Focused Summarization
Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization, 2024. URL https: //arxiv.org/abs/2404.16130
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[6]
Lightmem: Lightweight and efficient memory-augmented generation
Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. Lightmem: Lightweight and efficient memory-augmented generation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=dyJ0GWpjJB
work page 2026
-
[7]
ReTool: Reinforcement Learning for Strategic Tool Use in LLMs
Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URLhttps://arxiv.org/abs/2504.11536
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[8]
Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...
-
[9]
LightRAG: Simple and fast retrieval-augmented generation
Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. LightRAG: Simple and fast retrieval-augmented generation. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10746–10761, Suzhou, China, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-emnlp.568. URL https://aclanthology.org/ 2025.finding...
-
[10]
From RAG to memory: Non-parametric continual learning for large language models
Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From RAG to memory: Non-parametric continual learning for large language models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=LWH8yn4HS2
work page 2025
-
[11]
Memory in the age of AI agents,
Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...
-
[12]
URLhttps://arxiv.org/abs/2512.13564
work page internal anchor Pith review Pith/arXiv arXiv
-
[13]
Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen ...
-
[14]
WAGLE: Strategic weight attribution for effective and modular unlearning in large language models
Jinghan Jia, Jiancheng Liu, Yihua Zhang, Parikshit Ram, Nathalie Baracaldo, and Sijia Liu. WAGLE: Strategic weight attribution for effective and modular unlearning in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,
-
[15]
URLhttps://openreview.net/forum?id=VzOgnDJMgh
-
[16]
The AI hippocampus: How far are we from human memory?Transactions on Machine Learning Research, 2025
Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, and Song-Chun Zhu. The AI hippocampus: How far are we from human memory?Transactions on Machine Learning Research, 2025. URLhttps://openreview.net/forum?id=Sk7pwmLuAY
work page 2025
-
[17]
Graph chain-of-thought: Augmenting large language models by reasoning on graphs
Bowen Jin, Chulin Xie, Jiawei Zhang, Kashob Kumar Roy, Yu Zhang, Zheng Li, Ruirui Li, Xianfeng Tang, Suhang Wang, Yu Meng, and Jiawei Han. Graph chain-of-thought: Augmenting large language models by reasoning on graphs. InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 163–184, Bangkok, Thailand, 2024. Association for Computat...
work page 2024
-
[18]
Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning
Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[19]
Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory OS of AI agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25961– 25970, Suzhou, China, 2025. Association for Computational Linguistics. URL https://doi. org/10.18653/v1/2025.emnlp-main.1318. 11
-
[20]
A human-inspired reading agent with gist memory of very long contexts
Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts. InProceedings of the 41st Interna- tional Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 26396–26415. PMLR, 2024. URL https://proceedings.mlr.press/ v235/lee24c.html
work page 2024
-
[21]
Hello again! LLM-powered personalized agent for long-term dialogue
Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat-Seng Chua. Hello again! LLM-powered personalized agent for long-term dialogue. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5259–5276, Albuquerque, New M...
work page 2025
-
[22]
Zhuoqun Li, Xuanang Chen, Haiyang Yu, Hongyu Lin, Yaojie Lu, Qiaoyu Tang, Fei Huang, Xianpei Han, Le Sun, and Yongbin Li. StructRAG: Boosting knowledge intensive reasoning of LLMs via inference-time hybrid information structurization. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=GhexuBLxbO
work page 2025
-
[23]
Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. URL https://doi.org/ 10.1162/tacl_a_00638
-
[24]
Evaluating very long-term conversational memory of LLM agents
Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, Bangkok, Thailand, 2024. Association for Computational Linguis...
work page 2024
-
[25]
Towards lifelong dialogue agents via timeline-based memory management
Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung-won Hwang, Dongha Lee, and Jinyoung Yeo. Towards lifelong dialogue agents via timeline-based memory management. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...
work page 2025
-
[26]
URL https://aclanthology.org/2025
Association for Computational Linguistics. URL https://aclanthology.org/2025. naacl-long.435/
work page 2025
-
[27]
MemGPT: Towards LLMs as Operating Systems
Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. URL https: //arxiv.org/abs/2310.08560
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[28]
Vicky Zhao, Lili Qiu, and Dongmei Zhang
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024, pages 963–981, Ban...
work page 2024
-
[29]
Vicky Zhao, Lili Qiu, and Jianfeng Gao
Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Jianfeng Gao. Secom: On memory construction and retrieval for personalized conversational agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=xKDZAW0He3
work page 2025
-
[30]
Direct preference optimization: Your language model is secretly a reward model
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9
work page 2023
-
[31]
From isolated conversations to hierarchical schemas: Dynamic tree memory representation for LLMs
Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao. From isolated conversations to hierarchical schemas: Dynamic tree memory representation for LLMs. InThe Thirteenth 12 International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=moXtEmCleY
work page 2025
-
[32]
Beyond pipelines: A survey of the paradigm shift toward model-native agentic ai, 2025
Jitao Sang, Jinlin Xiao, Jiarun Han, Jilin Chen, Xiaoyi Chen, Shuyu Wei, Yongjie Sun, and Yuhang Wang. Beyond pipelines: A survey of the paradigm shift toward model-native agentic ai, 2025. URLhttps://arxiv.org/abs/2510.16720
-
[33]
Proximal Policy Optimization Algorithms
John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347
work page internal anchor Pith review Pith/arXiv arXiv 2017
-
[34]
Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[35]
REMem: Reasoning with episodic memory in language agent
Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. REMem: Reasoning with episodic memory in language agent. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=fugnQxbvMm
work page 2026
-
[36]
Jianghao Su, Xia Zeng, Luhui Liu, Chao Luo, Ye Chen, and Zhuoran Zhuang. Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization, 2026. URLhttps://arxiv.org/abs/2512.07478
-
[37]
H-MEM: Hierarchical memory for high-efficiency long-term reasoning in LLM agents
Haoran Sun, Shaoning Zeng, and Bob Zhang. H-MEM: Hierarchical memory for high-efficiency long-term reasoning in LLM agents. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 341–350, Rabat, Morocco, 2026. Association for Computational Linguistics. URL https: //doi.o...
-
[38]
In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents
Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Rajan Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, and Tomas Pfister. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for C...
-
[39]
URL https://aclanthology.org/2025
Association for Computational Linguistics. URL https://aclanthology.org/2025. acl-long.413/
work page 2025
-
[40]
TRL: Transformers Re- inforcement Learning
Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Re- inforcement Learning. https://github.com/huggingface/trl, 2020. Software library, Apache-2.0 license
work page 2020
-
[41]
Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, and Liang Ding. Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025. URLhttps://doi.org/10.1016/j.neucom.2025.130193
-
[42]
Beyond the limits: A survey of techniques to extend the context length in large language models
Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Ar- maghan Eshaghi. Beyond the limits: A survey of techniques to extend the context length in large language models. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 8299–8307. International Joint Conferences on Artifi...
-
[43]
Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum...
work page 2026
-
[44]
Long- memeval: Benchmarking chat assistants on long-term interactive memory
Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=pZiyCaVuti. 13
work page 2025
-
[45]
From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs
Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, and Yong Liu. From human memory to ai memory: A survey on memory mechanisms in the era of LLMs, 2025. URLhttps://arxiv.org/abs/2504.15965
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[46]
Xuan Xie, Xuan Wang, and Wenjie Wang. DaGRPO: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization. InLogical and Symbolic Reasoning in Language Models @ AAAI 2026, 2026. URL https://openreview.net/forum?id= SucCwKlD9k
work page 2026
-
[47]
Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, Wenlin Zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, and Tong Xu. From single to multi- granularity: Toward long-term memory association and selection of conversational agents. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/...
work page 2026
-
[48]
RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation
Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. InThe Twelfth International Con- ference on Learning Representations, 2024. URL https://openreview.net/forum?id= mlJLVigNHp
work page 2024
-
[49]
A-mem: Agentic memory for LLM agents
Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=FiM0M8gcct
work page 2026
- [50]
-
[51]
Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, V olker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2508.19828
work page internal anchor Pith review Pith/arXiv arXiv 2026
-
[52]
DAPO: An open-source LLM reinforcement learning system at scale
Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao...
work page 2026
-
[53]
Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=4OsgYD7em5
work page 2026
-
[54]
Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfo...
work page 2026
-
[55]
URLhttps://openreview.net/forum?id=RY19y2RI1O
-
[56]
Assomem: Scalable memory QA with multi-signal associative retrieval
Kai Zhang, Xinyuan Zhang, Ejaz Ahmed, Hongda Jiang, Caleb Kumar, Kai Sun, Zhaojiang Lin, Sanat Sharma, Shereen Oraby, AARON COLAK, Ahmed A Aly, Anuj Kumar, Xiaozhong Liu, and Xin Luna Dong. Assomem: Scalable memory QA with multi-signal associative retrieval. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openrevie...
work page 2026
-
[57]
Yujie Zhang, Weikang Yuan, and Zhuoren Jiang. Bridging intuitive associations and de- liberate recall: Empowering LLM personal assistant with graph-structured long-term mem- ory. InFindings of the Association for Computational Linguistics: ACL 2025, pages 17533–17547, Vienna, Austria, 2025. Association for Computational Linguistics. URL https://aclantholo...
work page 2025
-
[58]
Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43(6):155:1–155:47, 2025. URL https://doi.org/10.1145/3748302
-
[59]
Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: enhancing large language models with long-term memory. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, pages 19724–19731. AAAI Press, 2024. URL https: //doi.org/10.1609/aaai.v38i17.29946. 15 A Datasets and Baseline Methods A.1 Datasets. We evaluate lo...
-
[60]
Consider each message one by one
Read the question and the conversation history. Consider each message one by one
-
[61]
Add a message’s "msg_id" to "useful_msg" ONLY if that message is actually useful for answering the question
-
[62]
For every "msg_id" in "useful_msg", add exactly one entry to "distilled_info": - "msg_id": the same id - "info": a single self-contained statement (or a compact set of statements). Includes all information from the target message (i.e., the message of the same msg_id) that is useful for answering the question
-
[63]
Make "info" self-contained: - Conduct reference resolution (pronouns, ellipsis, named entities) when the referent is unambiguous in its surrounding context. - Interpret "I/we/my" from the perspective of the message’s speaker and interpret "you/your" as the conversational counterpart, unless the context indicates reported speech
-
[64]
Each "info" entry must be grounded primarily in the message of the same msg_id, plus minimal preceding discourse context when necessary. - You may use nearby preceding messages in the same segment for two limited purposes: (a) Reference resolution: resolve pronouns/ellipsis when unambiguous. (b) Discourse-context restoration: recover the minimal preceding...
-
[65]
this message is useful because
Do NOT include meta commentary (e.g., "this message is useful because...") in "info"
-
[66]
If the conversation history contains no information useful for answering the question, output: {{ "useful_msg": [], "distilled_info": [] }} Distiller user prompt [CONVERSATION HISTORY FORMAT] - The conversation history is a list of conversation segments. - Each segment is a list of messages. - Each message has the following fields: - ‘msg_id‘: the id of t...
work page 2023
-
[67]
Info_extracted: a list of {msg_id, info}
-
[68]
Original_Segs: a list of conversation segments containing the original messages. Data format of Original_Segs: - The conversation history is a list of conversation segments. - Each segment is a list of messages. - Each message has the following fields: - ‘msg_id‘: the id of the message. - ‘speaker‘ (optional): who said this message (e.g., "user", "assista...
work page 2023
-
[69]
resolve references/pronouns in the TARGET MESSAGE or in the info
-
[70]
I/we/my" are from the perspective of the TARGET MESSAGE’s speaker. -
recover the minimal conversational context needed to interpret what the TARGET MESSAGE is responding to. - Speaker viewpoint rule: - Pronouns like "I/we/my" are from the perspective of the TARGET MESSAGE’s speaker. - "you/your" refers to the conversational counterpart unless context indicates otherwise. - For reported speech/quotations, resolve pronouns b...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.