DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

Jianing Yin; Tan Tang

arxiv: 2605.22411 · v1 · pith:6RDIGSSGnew · submitted 2026-05-21 · 💻 cs.CL · cs.AI· cs.LG

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

Jianing Yin , Tan Tang This is my paper

Pith reviewed 2026-05-22 07:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords long-term memory QAevidence distillationreinforcement learningLLM agentsmemory systemsquery-time processingconversation history

0 comments

The pith

DeferMem distills query-specific evidence from long histories at query time using reinforcement learning to boost QA accuracy and efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models struggle when answers depend on evidence scattered across long conversation histories filled with irrelevant content. DeferMem decouples the task into broad candidate retrieval at query time followed by a learned distillation step that selects and rewrites messages into clean, self-contained evidence. It trains this distiller with DistillPO, a reinforcement learning method that breaks the distillation action into selection and rewriting steps and optimizes them with gated rewards that check validity before quality. This query-conditioned approach is meant to reduce the noise that downstream answerers would otherwise have to filter manually.

Core claim

The paper claims that post-retrieval evidence distillation can be cast as a structured RL action of message selection plus rewriting, optimized through a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, so that high-recall but noisy candidates are turned into faithful, query-specific evidence without requiring pre-query memory processing.

What carries the argument

DistillPO, the reinforcement learning algorithm that formulates evidence distillation as a structured action of selecting and rewriting retrieved messages, then optimizes it with decomposed rewards that gate from validity checks to quality checks and assign advantages to responsible output spans.

If this is right

On LoCoMo and LongMemEval-S it reaches the highest QA accuracy among the tested systems.
It delivers the fastest runtime for memory operations compared with strong baselines.
It incurs zero commercial-API token cost for all memory-related steps.
It improves both answer accuracy and overall memory-system efficiency over pre-processed memory approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same query-time selection-plus-rewriting pattern could be tested on other retrieval-augmented tasks where the query arrives after the full context is stored.
If the decomposed reward design generalizes, it might reduce the need for expensive pre-computation of memory summaries in long-running agent deployments.
The approach suggests a way to keep raw history intact while still producing compact evidence, which could help systems handle histories that grow without bound.

Load-bearing premise

That casting post-retrieval distillation as a structured RL action of selection and rewriting, with decomposed rewards, will reliably produce faithful query-conditioned outputs without introducing new hallucinations or omissions.

What would settle it

A set of long conversational test cases in which the distilled evidence either drops a key supporting fact present in the raw history or adds a fabricated detail, causing the final QA answer to be incorrect even when retrieval recall was high.

Figures

Figures reproduced from arXiv: 2605.22411 by Jianing Yin, Tan Tang.

**Figure 2.** Figure 2: DeferMem framework: (1) a segment-link retriever produces high-recall candidates, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Distribution of the number of distilled evidence entries returned by DeferMem. [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗

read the original abstract

Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

DeferMem adds query-time RL distillation with DistillPO on a segment-link retrieval base, which targets a real efficiency gap but leaves the faithfulness of the rewritten evidence as the main open question.

read the letter

The key point here is a query-time evidence distillation step trained via their DistillPO reinforcement learning algorithm, built on a segment-link retrieval backbone for handling long conversational histories in LLM QA. What stands out as new is the split between high-recall candidate retrieval and then distilling those into faithful query-specific evidence using structured actions and decomposed rewards with structure-aligned advantages. The paper does a solid job calling out the limitations of pre-processing memory without knowing the query and measures real-world efficiency metrics alongside accuracy. On the benchmarks it reports the best results in both categories, which is a nice practical outcome. The soft spots come from the abstract-only view so far. There are no specific numbers on the reward components or ablations, which makes it harder to gauge how much the RL contributes or if the faithfulness holds. The concern that advantage assignment might not catch unfaithful rewrites that pass coarse gates is a fair one to investigate in the full text; if the checks are localized properly it should be okay, but it needs verification. This work is aimed at researchers focused on long-term memory for agents and long-context applications. It has enough novelty and grounding in external benchmarks to warrant a serious referee. I would recommend sending it to peer review with attention to the RL design and any hallucination controls.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DeferMem, a long-term memory QA framework that decouples high-recall candidate retrieval (via a lightweight segment-link structure over conversational history) from query-conditioned evidence distillation. The distillation step is performed by a policy trained with DistillPO, which formulates the task as a structured RL action (message selection plus rewriting) optimized via a decomposed-and-gated reward pipeline and structure-aligned advantage assignment. The paper reports that DeferMem achieves the highest QA accuracy on LoCoMo and LongMemEval-S while also delivering the fastest runtime and zero commercial-API token cost for memory operations.

Significance. If the DistillPO training reliably yields faithful, query-specific evidence without introducing hallucinations or omissions, the query-time distillation approach could meaningfully improve efficiency and accuracy for LLM agents operating over long histories, by avoiding both pre-query compression and post-retrieval denoising.

major comments (2)

[§3.2] §3.2 (DistillPO formulation): The structure-aligned advantage assignment is presented as localizing each reward component to its responsible output span, yet the manuscript provides no explicit mechanism or analysis showing how message-level validity gates prevent unfaithful rewrites that add plausible but non-entailed content. This assumption is load-bearing for the central claim that the distilled evidence remains faithful while improving downstream QA accuracy.
[Experimental results] Experimental results (LoCoMo and LongMemEval-S tables): The headline performance claims rest on benchmark wins, but the text supplies no quantitative ablations on individual reward components, no error analysis of hallucination or omission rates in the distilled outputs, and no comparison against non-RL distillation baselines. Without these, it is difficult to attribute gains specifically to the RL design rather than the retrieval stage.

minor comments (2)

[Abstract] The abstract states 'zero commercial-API token cost for memory operations' but does not clarify whether the distiller itself incurs any external API usage during training or inference.
[§3.2] Notation for the decomposed reward terms (validity, quality, task correctness) is introduced without a consolidated table or equation summarizing their gating logic and weighting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will improve the clarity and evidential support for our claims.

read point-by-point responses

Referee: [§3.2] §3.2 (DistillPO formulation): The structure-aligned advantage assignment is presented as localizing each reward component to its responsible output span, yet the manuscript provides no explicit mechanism or analysis showing how message-level validity gates prevent unfaithful rewrites that add plausible but non-entailed content. This assumption is load-bearing for the central claim that the distilled evidence remains faithful while improving downstream QA accuracy.

Authors: We appreciate the referee highlighting the need for greater explicitness here. The decomposed-and-gated reward pipeline in §3.2 applies a message-level validity gate before any rewriting step, with invalid selections receiving zero reward and being excluded from further processing. This is intended to block propagation of unfaithful content. However, we agree the current text does not provide sufficient formalization or supporting analysis of this gating behavior. In the revision we will expand §3.2 with pseudocode for the validity gate, a step-by-step description of how it interacts with the structure-aligned advantage assignment, and qualitative examples illustrating prevention of non-entailed additions. revision: yes
Referee: [Experimental results] Experimental results (LoCoMo and LongMemEval-S tables): The headline performance claims rest on benchmark wins, but the text supplies no quantitative ablations on individual reward components, no error analysis of hallucination or omission rates in the distilled outputs, and no comparison against non-RL distillation baselines. Without these, it is difficult to attribute gains specifically to the RL design rather than the retrieval stage.

Authors: We agree that these analyses are important for isolating the contribution of DistillPO. The present experiments emphasize end-to-end QA accuracy and efficiency, but do not include the requested breakdowns. In the revised manuscript we will add quantitative ablations that remove or ablate individual reward components, a dedicated error analysis reporting hallucination and omission rates via manual inspection of a representative sample of distilled outputs, and direct comparisons against non-RL distillation baselines (such as prompting-based selection and rewriting without reinforcement learning). These results will be placed in the experimental section or an appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results measured on external benchmarks

full rationale

The paper introduces DeferMem and its DistillPO RL procedure as a proposed workflow (high-recall retrieval followed by query-time structured distillation via selection+rewriting), then reports empirical QA accuracy, runtime, and zero commercial token cost on the independent external benchmarks LoCoMo and LongMemEval-S. These headline metrics are obtained after training and are not algebraically or statistically forced by the decomposed rewards, validity gates, or structure-aligned advantage assignment; they constitute an independent evaluation. No equations, self-citations, or uniqueness theorems appear in the supplied text that would reduce the claimed performance to a re-labeling of the training inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be extracted beyond the implicit modeling choice that RL with gated rewards will produce faithful evidence.

pith-pipeline@v0.9.0 · 5786 in / 1071 out tokens · 28032 ms · 2026-05-22T07:08:14.153690+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting... decomposed-and-gated reward pipeline and structure-aligned advantage assignment

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 10 internal anchors

[1]

Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations

Nuo Chen, Hongguang Li, Jianhui Chang, Juhua Huang, Baoyuan Wang, and Jia Li. Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations. InProceedings of the 31st International Conference on Computational Linguistics, pages 755–773, Abu Dhabi, UAE, 2025. Association for Computational Linguistics. URL https: ...

work page 2025
[2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory, 2025. URL https: //arxiv.org/abs/2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025
[3]

Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z. Pan. Rethinking memory in LLM based agents: Representations, operations, and emerging topics, 2025. URLhttps://arxiv.org/abs/2505.00675

work page arXiv 2025
[4]

Pan, Yuxin Jiang, and Kam-Fai Wong

Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang XUE, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, and Kam-Fai Wong. Memory-t1: Reinforcement learning for temporal reasoning in multi-session agents. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openr...

work page 2026
[5]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization, 2024. URL https: //arxiv.org/abs/2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2024
[6]

Lightmem: Lightweight and efficient memory-augmented generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. Lightmem: Lightweight and efficient memory-augmented generation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=dyJ0GWpjJB

work page 2026
[7]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URLhttps://arxiv.org/abs/2504.11536

work page internal anchor Pith review Pith/arXiv arXiv 2025
[8]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025
[9]

LightRAG: Simple and fast retrieval-augmented generation

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. LightRAG: Simple and fast retrieval-augmented generation. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10746–10761, Suzhou, China, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-emnlp.568. URL https://aclanthology.org/ 2025.finding...

work page doi:10.18653/v1/2025.findings-emnlp.568 2025
[10]

From RAG to memory: Non-parametric continual learning for large language models

Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From RAG to memory: Non-parametric continual learning for large language models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=LWH8yn4HS2

work page 2025
[11]

Memory in the age of AI agents,

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page
[12]

URLhttps://arxiv.org/abs/2512.13564

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen ...

work page arXiv 2026
[14]

WAGLE: Strategic weight attribution for effective and modular unlearning in large language models

Jinghan Jia, Jiancheng Liu, Yihua Zhang, Parikshit Ram, Nathalie Baracaldo, and Sijia Liu. WAGLE: Strategic weight attribution for effective and modular unlearning in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page
[15]

URLhttps://openreview.net/forum?id=VzOgnDJMgh

work page
[16]

The AI hippocampus: How far are we from human memory?Transactions on Machine Learning Research, 2025

Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, and Song-Chun Zhu. The AI hippocampus: How far are we from human memory?Transactions on Machine Learning Research, 2025. URLhttps://openreview.net/forum?id=Sk7pwmLuAY

work page 2025
[17]

Graph chain-of-thought: Augmenting large language models by reasoning on graphs

Bowen Jin, Chulin Xie, Jiawei Zhang, Kashob Kumar Roy, Yu Zhang, Zheng Li, Ruirui Li, Xianfeng Tang, Suhang Wang, Yu Meng, and Jiawei Han. Graph chain-of-thought: Augmenting large language models by reasoning on graphs. InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 163–184, Bangkok, Thailand, 2024. Association for Computat...

work page 2024
[18]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516

work page internal anchor Pith review Pith/arXiv arXiv 2025
[19]

Memory OS of AI agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory OS of AI agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25961– 25970, Suzhou, China, 2025. Association for Computational Linguistics. URL https://doi. org/10.18653/v1/2025.emnlp-main.1318. 11

work page doi:10.18653/v1/2025.emnlp-main.1318 2025
[20]

A human-inspired reading agent with gist memory of very long contexts

Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts. InProceedings of the 41st Interna- tional Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 26396–26415. PMLR, 2024. URL https://proceedings.mlr.press/ v235/lee24c.html

work page 2024
[21]

Hello again! LLM-powered personalized agent for long-term dialogue

Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat-Seng Chua. Hello again! LLM-powered personalized agent for long-term dialogue. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5259–5276, Albuquerque, New M...

work page 2025
[22]

StructRAG: Boosting knowledge intensive reasoning of LLMs via inference-time hybrid information structurization

Zhuoqun Li, Xuanang Chen, Haiyang Yu, Hongyu Lin, Yaojie Lu, Qiaoyu Tang, Fei Huang, Xianpei Han, Le Sun, and Yongbin Li. StructRAG: Boosting knowledge intensive reasoning of LLMs via inference-time hybrid information structurization. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=GhexuBLxbO

work page 2025
[23]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. URL https://doi.org/ 10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024
[24]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, Bangkok, Thailand, 2024. Association for Computational Linguis...

work page 2024
[25]

Towards lifelong dialogue agents via timeline-based memory management

Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung-won Hwang, Dongha Lee, and Jinyoung Yeo. Towards lifelong dialogue agents via timeline-based memory management. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

work page 2025
[26]

URL https://aclanthology.org/2025

Association for Computational Linguistics. URL https://aclanthology.org/2025. naacl-long.435/

work page 2025
[27]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. URL https: //arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2024
[28]

Vicky Zhao, Lili Qiu, and Dongmei Zhang

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024, pages 963–981, Ban...

work page 2024
[29]

Vicky Zhao, Lili Qiu, and Jianfeng Gao

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Jianfeng Gao. Secom: On memory construction and retrieval for personalized conversational agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=xKDZAW0He3

work page 2025
[30]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9

work page 2023
[31]

From isolated conversations to hierarchical schemas: Dynamic tree memory representation for LLMs

Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao. From isolated conversations to hierarchical schemas: Dynamic tree memory representation for LLMs. InThe Thirteenth 12 International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=moXtEmCleY

work page 2025
[32]

Beyond pipelines: A survey of the paradigm shift toward model-native agentic ai, 2025

Jitao Sang, Jinlin Xiao, Jiarun Han, Jilin Chen, Xiaoyi Chen, Shuyu Wei, Yongjie Sun, and Yuhang Wang. Beyond pipelines: A survey of the paradigm shift toward model-native agentic ai, 2025. URLhttps://arxiv.org/abs/2510.16720

work page arXiv 2025
[33]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017
[34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024
[35]

REMem: Reasoning with episodic memory in language agent

Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. REMem: Reasoning with episodic memory in language agent. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=fugnQxbvMm

work page 2026
[36]

Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization.arXiv preprint arXiv:2512.07478, 2025

Jianghao Su, Xia Zeng, Luhui Liu, Chao Luo, Ye Chen, and Zhuoran Zhuang. Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization, 2026. URLhttps://arxiv.org/abs/2512.07478

work page arXiv 2026
[37]

H-MEM: Hierarchical memory for high-efficiency long-term reasoning in LLM agents

Haoran Sun, Shaoning Zeng, and Bob Zhang. H-MEM: Hierarchical memory for high-efficiency long-term reasoning in LLM agents. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 341–350, Rabat, Morocco, 2026. Association for Computational Linguistics. URL https: //doi.o...

work page doi:10.18653/v1/2026.eacl-long.15 2026
[38]

In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents

Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Rajan Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, and Tomas Pfister. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for C...

work page
[39]

URL https://aclanthology.org/2025

Association for Computational Linguistics. URL https://aclanthology.org/2025. acl-long.413/

work page 2025
[40]

TRL: Transformers Re- inforcement Learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Re- inforcement Learning. https://github.com/huggingface/trl, 2020. Software library, Apache-2.0 license

work page 2020
[41]

Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025

Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, and Liang Ding. Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025. URLhttps://doi.org/10.1016/j.neucom.2025.130193

work page doi:10.1016/j.neucom.2025.130193 2025
[42]

Beyond the limits: A survey of techniques to extend the context length in large language models

Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Ar- maghan Eshaghi. Beyond the limits: A survey of techniques to extend the context length in large language models. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 8299–8307. International Joint Conferences on Artifi...

work page doi:10.24963/ijcai.2024/917 2024
[43]

Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum...

work page 2026
[44]

Long- memeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=pZiyCaVuti. 13

work page 2025
[45]

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, and Yong Liu. From human memory to ai memory: A survey on memory mechanisms in the era of LLMs, 2025. URLhttps://arxiv.org/abs/2504.15965

work page internal anchor Pith review Pith/arXiv arXiv 2025
[46]

DaGRPO: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization

Xuan Xie, Xuan Wang, and Wenjie Wang. DaGRPO: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization. InLogical and Symbolic Reasoning in Language Models @ AAAI 2026, 2026. URL https://openreview.net/forum?id= SucCwKlD9k

work page 2026
[47]

From single to multi- granularity: Toward long-term memory association and selection of conversational agents

Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, Wenlin Zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, and Tong Xu. From single to multi- granularity: Toward long-term memory association and selection of conversational agents. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/...

work page 2026
[48]

RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation

Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. InThe Twelfth International Con- ference on Learning Representations, 2024. URL https://openreview.net/forum?id= mlJLVigNHp

work page 2024
[49]

A-mem: Agentic memory for LLM agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=FiM0M8gcct

work page 2026
[50]

B. Y . Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, and Zheng Liu. General agentic memory via deep research, 2025. URLhttps://arxiv.org/abs/2511.18423

work page arXiv 2025
[51]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, V olker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2508.19828

work page internal anchor Pith review Pith/arXiv arXiv 2026
[52]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao...

work page 2026
[53]

Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=4OsgYD7em5

work page 2026
[54]

The landscape of agentic reinforcement learning for llms: A survey.Transactions on Machine Learning Research, 2026,

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfo...

work page 2026
[55]

URLhttps://openreview.net/forum?id=RY19y2RI1O

work page
[56]

Assomem: Scalable memory QA with multi-signal associative retrieval

Kai Zhang, Xinyuan Zhang, Ejaz Ahmed, Hongda Jiang, Caleb Kumar, Kai Sun, Zhaojiang Lin, Sanat Sharma, Shereen Oraby, AARON COLAK, Ahmed A Aly, Anuj Kumar, Xiaozhong Liu, and Xin Luna Dong. Assomem: Scalable memory QA with multi-signal associative retrieval. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openrevie...

work page 2026
[57]

Bridging intuitive associations and de- liberate recall: Empowering LLM personal assistant with graph-structured long-term mem- ory

Yujie Zhang, Weikang Yuan, and Zhuoren Jiang. Bridging intuitive associations and de- liberate recall: Empowering LLM personal assistant with graph-structured long-term mem- ory. InFindings of the Association for Computational Linguistics: ACL 2025, pages 17533–17547, Vienna, Austria, 2025. Association for Computational Linguistics. URL https://aclantholo...

work page 2025
[58]

A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43(6):155:1–155:47, 2025

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43(6):155:1–155:47, 2025. URL https://doi.org/10.1145/3748302

work page doi:10.1145/3748302 2025
[59]

Adversarial eval

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: enhancing large language models with long-term memory. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, pages 19724–19731. AAAI Press, 2024. URL https: //doi.org/10.1609/aaai.v38i17.29946. 15 A Datasets and Baseline Methods A.1 Datasets. We evaluate lo...

work page doi:10.1609/aaai.v38i17.29946 2024
[60]

Consider each message one by one

Read the question and the conversation history. Consider each message one by one

work page
[61]

msg_id" to

Add a message’s "msg_id" to "useful_msg" ONLY if that message is actually useful for answering the question

work page
[62]

msg_id" in

For every "msg_id" in "useful_msg", add exactly one entry to "distilled_info": - "msg_id": the same id - "info": a single self-contained statement (or a compact set of statements). Includes all information from the target message (i.e., the message of the same msg_id) that is useful for answering the question

work page
[63]

info" self-contained: - Conduct reference resolution (pronouns, ellipsis, named entities) when the referent is unambiguous in its surrounding context. - Interpret

Make "info" self-contained: - Conduct reference resolution (pronouns, ellipsis, named entities) when the referent is unambiguous in its surrounding context. - Interpret "I/we/my" from the perspective of the message’s speaker and interpret "you/your" as the conversational counterpart, unless the context indicates reported speech

work page
[64]

education field

Each "info" entry must be grounded primarily in the message of the same msg_id, plus minimal preceding discourse context when necessary. - You may use nearby preceding messages in the same segment for two limited purposes: (a) Reference resolution: resolve pronouns/ellipsis when unambiguous. (b) Discourse-context restoration: recover the minimal preceding...

work page
[65]

this message is useful because

Do NOT include meta commentary (e.g., "this message is useful because...") in "info"

work page
[66]

useful_msg

If the conversation history contains no information useful for answering the question, output: {{ "useful_msg": [], "distilled_info": [] }} Distiller user prompt [CONVERSATION HISTORY FORMAT] - The conversation history is a list of conversation segments. - Each segment is a list of messages. - Each message has the following fields: - ‘msg_id‘: the id of t...

work page 2023
[67]

Info_extracted: a list of {msg_id, info}

work page
[68]

user", "assistant

Original_Segs: a list of conversation segments containing the original messages. Data format of Original_Segs: - The conversation history is a list of conversation segments. - Each segment is a list of messages. - Each message has the following fields: - ‘msg_id‘: the id of the message. - ‘speaker‘ (optional): who said this message (e.g., "user", "assista...

work page 2023
[69]

resolve references/pronouns in the TARGET MESSAGE or in the info

work page
[70]

I/we/my" are from the perspective of the TARGET MESSAGE’s speaker. -

recover the minimal conversational context needed to interpret what the TARGET MESSAGE is responding to. - Speaker viewpoint rule: - Pronouns like "I/we/my" are from the perspective of the TARGET MESSAGE’s speaker. - "you/your" refers to the conversational counterpart unless context indicates otherwise. - For reported speech/quotations, resolve pronouns b...

work page arXiv 2023

[1] [1]

Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations

Nuo Chen, Hongguang Li, Jianhui Chang, Juhua Huang, Baoyuan Wang, and Jia Li. Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations. InProceedings of the 31st International Conference on Computational Linguistics, pages 755–773, Abu Dhabi, UAE, 2025. Association for Computational Linguistics. URL https: ...

work page 2025

[2] [2]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory, 2025. URL https: //arxiv.org/abs/2504.19413

work page internal anchor Pith review Pith/arXiv arXiv 2025

[3] [3]

Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z. Pan. Rethinking memory in LLM based agents: Representations, operations, and emerging topics, 2025. URLhttps://arxiv.org/abs/2505.00675

work page arXiv 2025

[4] [4]

Pan, Yuxin Jiang, and Kam-Fai Wong

Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang XUE, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, and Kam-Fai Wong. Memory-t1: Reinforcement learning for temporal reasoning in multi-session agents. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openr...

work page 2026

[5] [5]

From Local to Global: A Graph RAG Approach to Query-Focused Summarization

Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization, 2024. URL https: //arxiv.org/abs/2404.16130

work page internal anchor Pith review Pith/arXiv arXiv 2024

[6] [6]

Lightmem: Lightweight and efficient memory-augmented generation

Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. Lightmem: Lightweight and efficient memory-augmented generation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=dyJ0GWpjJB

work page 2026

[7] [7]

ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URLhttps://arxiv.org/abs/2504.11536

work page internal anchor Pith review Pith/arXiv arXiv 2025

[8] [8]

Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

work page doi:10.1038/s41586-025-09422-z 2025

[9] [9]

LightRAG: Simple and fast retrieval-augmented generation

Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. LightRAG: Simple and fast retrieval-augmented generation. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10746–10761, Suzhou, China, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-emnlp.568. URL https://aclanthology.org/ 2025.finding...

work page doi:10.18653/v1/2025.findings-emnlp.568 2025

[10] [10]

From RAG to memory: Non-parametric continual learning for large language models

Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From RAG to memory: Non-parametric continual learning for large language models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=LWH8yn4HS2

work page 2025

[11] [11]

Memory in the age of AI agents,

Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

work page

[12] [12]

URLhttps://arxiv.org/abs/2512.13564

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen ...

work page arXiv 2026

[14] [14]

WAGLE: Strategic weight attribution for effective and modular unlearning in large language models

Jinghan Jia, Jiancheng Liu, Yihua Zhang, Parikshit Ram, Nathalie Baracaldo, and Sijia Liu. WAGLE: Strategic weight attribution for effective and modular unlearning in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

work page

[15] [15]

URLhttps://openreview.net/forum?id=VzOgnDJMgh

work page

[16] [16]

The AI hippocampus: How far are we from human memory?Transactions on Machine Learning Research, 2025

Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, and Song-Chun Zhu. The AI hippocampus: How far are we from human memory?Transactions on Machine Learning Research, 2025. URLhttps://openreview.net/forum?id=Sk7pwmLuAY

work page 2025

[17] [17]

Graph chain-of-thought: Augmenting large language models by reasoning on graphs

Bowen Jin, Chulin Xie, Jiawei Zhang, Kashob Kumar Roy, Yu Zhang, Zheng Li, Ruirui Li, Xianfeng Tang, Suhang Wang, Yu Meng, and Jiawei Han. Graph chain-of-thought: Augmenting large language models by reasoning on graphs. InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 163–184, Bangkok, Thailand, 2024. Association for Computat...

work page 2024

[18] [18]

Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516

work page internal anchor Pith review Pith/arXiv arXiv 2025

[19] [19]

Memory OS of AI agent

Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory OS of AI agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25961– 25970, Suzhou, China, 2025. Association for Computational Linguistics. URL https://doi. org/10.18653/v1/2025.emnlp-main.1318. 11

work page doi:10.18653/v1/2025.emnlp-main.1318 2025

[20] [20]

A human-inspired reading agent with gist memory of very long contexts

Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts. InProceedings of the 41st Interna- tional Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 26396–26415. PMLR, 2024. URL https://proceedings.mlr.press/ v235/lee24c.html

work page 2024

[21] [21]

Hello again! LLM-powered personalized agent for long-term dialogue

Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat-Seng Chua. Hello again! LLM-powered personalized agent for long-term dialogue. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5259–5276, Albuquerque, New M...

work page 2025

[22] [22]

StructRAG: Boosting knowledge intensive reasoning of LLMs via inference-time hybrid information structurization

Zhuoqun Li, Xuanang Chen, Haiyang Yu, Hongyu Lin, Yaojie Lu, Qiaoyu Tang, Fei Huang, Xianpei Han, Le Sun, and Yongbin Li. StructRAG: Boosting knowledge intensive reasoning of LLMs via inference-time hybrid information structurization. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=GhexuBLxbO

work page 2025

[23] [23]

and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. URL https://doi.org/ 10.1162/tacl_a_00638

work page doi:10.1162/tacl_a_00638 2024

[24] [24]

Evaluating very long-term conversational memory of LLM agents

Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, Bangkok, Thailand, 2024. Association for Computational Linguis...

work page 2024

[25] [25]

Towards lifelong dialogue agents via timeline-based memory management

Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung-won Hwang, Dongha Lee, and Jinyoung Yeo. Towards lifelong dialogue agents via timeline-based memory management. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

work page 2025

[26] [26]

URL https://aclanthology.org/2025

Association for Computational Linguistics. URL https://aclanthology.org/2025. naacl-long.435/

work page 2025

[27] [27]

MemGPT: Towards LLMs as Operating Systems

Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. URL https: //arxiv.org/abs/2310.08560

work page internal anchor Pith review Pith/arXiv arXiv 2024

[28] [28]

Vicky Zhao, Lili Qiu, and Dongmei Zhang

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024, pages 963–981, Ban...

work page 2024

[29] [29]

Vicky Zhao, Lili Qiu, and Jianfeng Gao

Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Jianfeng Gao. Secom: On memory construction and retrieval for personalized conversational agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=xKDZAW0He3

work page 2025

[30] [30]

Direct preference optimization: Your language model is secretly a reward model

Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9

work page 2023

[31] [31]

From isolated conversations to hierarchical schemas: Dynamic tree memory representation for LLMs

Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao. From isolated conversations to hierarchical schemas: Dynamic tree memory representation for LLMs. InThe Thirteenth 12 International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=moXtEmCleY

work page 2025

[32] [32]

Beyond pipelines: A survey of the paradigm shift toward model-native agentic ai, 2025

Jitao Sang, Jinlin Xiao, Jiarun Han, Jilin Chen, Xiaoyi Chen, Shuyu Wei, Yongjie Sun, and Yuhang Wang. Beyond pipelines: A survey of the paradigm shift toward model-native agentic ai, 2025. URLhttps://arxiv.org/abs/2510.16720

work page arXiv 2025

[33] [33]

Proximal Policy Optimization Algorithms

John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

work page internal anchor Pith review Pith/arXiv arXiv 2017

[34] [34]

Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

work page internal anchor Pith review Pith/arXiv arXiv 2024

[35] [35]

REMem: Reasoning with episodic memory in language agent

Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. REMem: Reasoning with episodic memory in language agent. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=fugnQxbvMm

work page 2026

[36] [36]

Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization.arXiv preprint arXiv:2512.07478, 2025

Jianghao Su, Xia Zeng, Luhui Liu, Chao Luo, Ye Chen, and Zhuoran Zhuang. Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization, 2026. URLhttps://arxiv.org/abs/2512.07478

work page arXiv 2026

[37] [37]

H-MEM: Hierarchical memory for high-efficiency long-term reasoning in LLM agents

Haoran Sun, Shaoning Zeng, and Bob Zhang. H-MEM: Hierarchical memory for high-efficiency long-term reasoning in LLM agents. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 341–350, Rabat, Morocco, 2026. Association for Computational Linguistics. URL https: //doi.o...

work page doi:10.18653/v1/2026.eacl-long.15 2026

[38] [38]

In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents

Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Rajan Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, and Tomas Pfister. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for C...

work page

[39] [39]

URL https://aclanthology.org/2025

Association for Computational Linguistics. URL https://aclanthology.org/2025. acl-long.413/

work page 2025

[40] [40]

TRL: Transformers Re- inforcement Learning

Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Re- inforcement Learning. https://github.com/huggingface/trl, 2020. Software library, Apache-2.0 license

work page 2020

[41] [41]

Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025

Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, and Liang Ding. Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025. URLhttps://doi.org/10.1016/j.neucom.2025.130193

work page doi:10.1016/j.neucom.2025.130193 2025

[42] [42]

Beyond the limits: A survey of techniques to extend the context length in large language models

Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Ar- maghan Eshaghi. Beyond the limits: A survey of techniques to extend the context length in large language models. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 8299–8307. International Joint Conferences on Artifi...

work page doi:10.24963/ijcai.2024/917 2024

[43] [43]

Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs

Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum...

work page 2026

[44] [44]

Long- memeval: Benchmarking chat assistants on long-term interactive memory

Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=pZiyCaVuti. 13

work page 2025

[45] [45]

From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, and Yong Liu. From human memory to ai memory: A survey on memory mechanisms in the era of LLMs, 2025. URLhttps://arxiv.org/abs/2504.15965

work page internal anchor Pith review Pith/arXiv arXiv 2025

[46] [46]

DaGRPO: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization

Xuan Xie, Xuan Wang, and Wenjie Wang. DaGRPO: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization. InLogical and Symbolic Reasoning in Language Models @ AAAI 2026, 2026. URL https://openreview.net/forum?id= SucCwKlD9k

work page 2026

[47] [47]

From single to multi- granularity: Toward long-term memory association and selection of conversational agents

Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, Wenlin Zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, and Tong Xu. From single to multi- granularity: Toward long-term memory association and selection of conversational agents. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/...

work page 2026

[48] [48]

RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation

Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. InThe Twelfth International Con- ference on Learning Representations, 2024. URL https://openreview.net/forum?id= mlJLVigNHp

work page 2024

[49] [49]

A-mem: Agentic memory for LLM agents

Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=FiM0M8gcct

work page 2026

[50] [50]

B. Y . Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, and Zheng Liu. General agentic memory via deep research, 2025. URLhttps://arxiv.org/abs/2511.18423

work page arXiv 2025

[51] [51]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, V olker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2508.19828

work page internal anchor Pith review Pith/arXiv arXiv 2026

[52] [52]

DAPO: An open-source LLM reinforcement learning system at scale

Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao...

work page 2026

[53] [53]

Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=4OsgYD7em5

work page 2026

[54] [54]

The landscape of agentic reinforcement learning for llms: A survey.Transactions on Machine Learning Research, 2026,

Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfo...

work page 2026

[55] [55]

URLhttps://openreview.net/forum?id=RY19y2RI1O

work page

[56] [56]

Assomem: Scalable memory QA with multi-signal associative retrieval

Kai Zhang, Xinyuan Zhang, Ejaz Ahmed, Hongda Jiang, Caleb Kumar, Kai Sun, Zhaojiang Lin, Sanat Sharma, Shereen Oraby, AARON COLAK, Ahmed A Aly, Anuj Kumar, Xiaozhong Liu, and Xin Luna Dong. Assomem: Scalable memory QA with multi-signal associative retrieval. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openrevie...

work page 2026

[57] [57]

Bridging intuitive associations and de- liberate recall: Empowering LLM personal assistant with graph-structured long-term mem- ory

Yujie Zhang, Weikang Yuan, and Zhuoren Jiang. Bridging intuitive associations and de- liberate recall: Empowering LLM personal assistant with graph-structured long-term mem- ory. InFindings of the Association for Computational Linguistics: ACL 2025, pages 17533–17547, Vienna, Austria, 2025. Association for Computational Linguistics. URL https://aclantholo...

work page 2025

[58] [58]

A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43(6):155:1–155:47, 2025

Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43(6):155:1–155:47, 2025. URL https://doi.org/10.1145/3748302

work page doi:10.1145/3748302 2025

[59] [59]

Adversarial eval

Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: enhancing large language models with long-term memory. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, pages 19724–19731. AAAI Press, 2024. URL https: //doi.org/10.1609/aaai.v38i17.29946. 15 A Datasets and Baseline Methods A.1 Datasets. We evaluate lo...

work page doi:10.1609/aaai.v38i17.29946 2024

[60] [60]

Consider each message one by one

Read the question and the conversation history. Consider each message one by one

work page

[61] [61]

msg_id" to

Add a message’s "msg_id" to "useful_msg" ONLY if that message is actually useful for answering the question

work page

[62] [62]

msg_id" in

For every "msg_id" in "useful_msg", add exactly one entry to "distilled_info": - "msg_id": the same id - "info": a single self-contained statement (or a compact set of statements). Includes all information from the target message (i.e., the message of the same msg_id) that is useful for answering the question

work page

[63] [63]

info" self-contained: - Conduct reference resolution (pronouns, ellipsis, named entities) when the referent is unambiguous in its surrounding context. - Interpret

Make "info" self-contained: - Conduct reference resolution (pronouns, ellipsis, named entities) when the referent is unambiguous in its surrounding context. - Interpret "I/we/my" from the perspective of the message’s speaker and interpret "you/your" as the conversational counterpart, unless the context indicates reported speech

work page

[64] [64]

education field

Each "info" entry must be grounded primarily in the message of the same msg_id, plus minimal preceding discourse context when necessary. - You may use nearby preceding messages in the same segment for two limited purposes: (a) Reference resolution: resolve pronouns/ellipsis when unambiguous. (b) Discourse-context restoration: recover the minimal preceding...

work page

[65] [65]

this message is useful because

Do NOT include meta commentary (e.g., "this message is useful because...") in "info"

work page

[66] [66]

useful_msg

If the conversation history contains no information useful for answering the question, output: {{ "useful_msg": [], "distilled_info": [] }} Distiller user prompt [CONVERSATION HISTORY FORMAT] - The conversation history is a list of conversation segments. - Each segment is a list of messages. - Each message has the following fields: - ‘msg_id‘: the id of t...

work page 2023

[67] [67]

Info_extracted: a list of {msg_id, info}

work page

[68] [68]

user", "assistant

Original_Segs: a list of conversation segments containing the original messages. Data format of Original_Segs: - The conversation history is a list of conversation segments. - Each segment is a list of messages. - Each message has the following fields: - ‘msg_id‘: the id of the message. - ‘speaker‘ (optional): who said this message (e.g., "user", "assista...

work page 2023

[69] [69]

resolve references/pronouns in the TARGET MESSAGE or in the info

work page

[70] [70]

I/we/my" are from the perspective of the TARGET MESSAGE’s speaker. -

recover the minimal conversational context needed to interpret what the TARGET MESSAGE is responding to. - Speaker viewpoint rule: - Pronouns like "I/we/my" are from the perspective of the TARGET MESSAGE’s speaker. - "you/your" refers to the conversational counterpart unless context indicates otherwise. - For reported speech/quotations, resolve pronouns b...

work page arXiv 2023