pith. sign in

arxiv: 2605.22411 · v1 · pith:6RDIGSSGnew · submitted 2026-05-21 · 💻 cs.CL · cs.AI· cs.LG

DeferMem: Query-Time Evidence Distillation via Reinforcement Learning for Long-Term Memory QA

Pith reviewed 2026-05-22 07:08 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords long-term memory QAevidence distillationreinforcement learningLLM agentsmemory systemsquery-time processingconversation history
0
0 comments X

The pith

DeferMem distills query-specific evidence from long histories at query time using reinforcement learning to boost QA accuracy and efficiency.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models struggle when answers depend on evidence scattered across long conversation histories filled with irrelevant content. DeferMem decouples the task into broad candidate retrieval at query time followed by a learned distillation step that selects and rewrites messages into clean, self-contained evidence. It trains this distiller with DistillPO, a reinforcement learning method that breaks the distillation action into selection and rewriting steps and optimizes them with gated rewards that check validity before quality. This query-conditioned approach is meant to reduce the noise that downstream answerers would otherwise have to filter manually.

Core claim

The paper claims that post-retrieval evidence distillation can be cast as a structured RL action of message selection plus rewriting, optimized through a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, so that high-recall but noisy candidates are turned into faithful, query-specific evidence without requiring pre-query memory processing.

What carries the argument

DistillPO, the reinforcement learning algorithm that formulates evidence distillation as a structured action of selecting and rewriting retrieved messages, then optimizes it with decomposed rewards that gate from validity checks to quality checks and assign advantages to responsible output spans.

If this is right

  • On LoCoMo and LongMemEval-S it reaches the highest QA accuracy among the tested systems.
  • It delivers the fastest runtime for memory operations compared with strong baselines.
  • It incurs zero commercial-API token cost for all memory-related steps.
  • It improves both answer accuracy and overall memory-system efficiency over pre-processed memory approaches.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same query-time selection-plus-rewriting pattern could be tested on other retrieval-augmented tasks where the query arrives after the full context is stored.
  • If the decomposed reward design generalizes, it might reduce the need for expensive pre-computation of memory summaries in long-running agent deployments.
  • The approach suggests a way to keep raw history intact while still producing compact evidence, which could help systems handle histories that grow without bound.

Load-bearing premise

That casting post-retrieval distillation as a structured RL action of selection and rewriting, with decomposed rewards, will reliably produce faithful query-conditioned outputs without introducing new hallucinations or omissions.

What would settle it

A set of long conversational test cases in which the distilled evidence either drops a key supporting fact present in the raw history or adds a fabricated detail, causing the final QA answer to be incorrect even when retrieval recall was high.

Figures

Figures reproduced from arXiv: 2605.22411 by Jianing Yin, Tan Tang.

Figure 1
Figure 1. Figure 1: Comparison between existing memory systems (left) and DeferMem (right). [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: DeferMem framework: (1) a segment-link retriever produces high-recall candidates, and [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Distribution of the number of distilled evidence entries returned by DeferMem. [PITH_FULL_IMAGE:figures/full_fig_p028_3.png] view at source ↗
read the original abstract

Large language model (LLM) agents still struggle with long-term memory question answering, where answer-supporting evidence is often scattered across long conversational histories and buried in substantial irrelevant content. Existing memory systems typically process memory before future queries are known, then retrieve the resulting units based on similarity rather than their utility for answering the query. This workflow leaves downstream answerers to denoise retrieved candidates and reconstruct query-specific evidence. We present DeferMem, a long-term memory framework that decouples this problem into high-recall candidate retrieval and query-conditioned evidence distillation. DeferMem uses a lightweight segment-link structure to organize raw history and retrieve broad candidates at query time. It then applies a memory distiller trained with DistillPO, our reinforcement learning algorithm for distilling the high-recall but highly noisy candidates into a set of faithful, self-contained, and query-conditioned evidence. DistillPO formulates post-retrieval evidence distillation as a structured action comprising message selection and evidence rewriting. It optimizes this action with a decomposed-and-gated reward pipeline and structure-aligned advantage assignment, gating reward components from validity to quality checks while exposing task-level correctness feedback early and assigning each reward to its responsible output span. On LoCoMo and LongMemEval-S, DeferMem surpasses strong baselines in QA accuracy and memory-system efficiency, achieving the highest QA accuracy with the fastest runtime and zero commercial-API token cost for memory operations.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents DeferMem, a long-term memory QA framework that decouples high-recall candidate retrieval (via a lightweight segment-link structure over conversational history) from query-conditioned evidence distillation. The distillation step is performed by a policy trained with DistillPO, which formulates the task as a structured RL action (message selection plus rewriting) optimized via a decomposed-and-gated reward pipeline and structure-aligned advantage assignment. The paper reports that DeferMem achieves the highest QA accuracy on LoCoMo and LongMemEval-S while also delivering the fastest runtime and zero commercial-API token cost for memory operations.

Significance. If the DistillPO training reliably yields faithful, query-specific evidence without introducing hallucinations or omissions, the query-time distillation approach could meaningfully improve efficiency and accuracy for LLM agents operating over long histories, by avoiding both pre-query compression and post-retrieval denoising.

major comments (2)
  1. [§3.2] §3.2 (DistillPO formulation): The structure-aligned advantage assignment is presented as localizing each reward component to its responsible output span, yet the manuscript provides no explicit mechanism or analysis showing how message-level validity gates prevent unfaithful rewrites that add plausible but non-entailed content. This assumption is load-bearing for the central claim that the distilled evidence remains faithful while improving downstream QA accuracy.
  2. [Experimental results] Experimental results (LoCoMo and LongMemEval-S tables): The headline performance claims rest on benchmark wins, but the text supplies no quantitative ablations on individual reward components, no error analysis of hallucination or omission rates in the distilled outputs, and no comparison against non-RL distillation baselines. Without these, it is difficult to attribute gains specifically to the RL design rather than the retrieval stage.
minor comments (2)
  1. [Abstract] The abstract states 'zero commercial-API token cost for memory operations' but does not clarify whether the distiller itself incurs any external API usage during training or inference.
  2. [§3.2] Notation for the decomposed reward terms (validity, quality, task correctness) is introduced without a consolidated table or equation summarizing their gating logic and weighting.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and commit to revisions that will improve the clarity and evidential support for our claims.

read point-by-point responses
  1. Referee: [§3.2] §3.2 (DistillPO formulation): The structure-aligned advantage assignment is presented as localizing each reward component to its responsible output span, yet the manuscript provides no explicit mechanism or analysis showing how message-level validity gates prevent unfaithful rewrites that add plausible but non-entailed content. This assumption is load-bearing for the central claim that the distilled evidence remains faithful while improving downstream QA accuracy.

    Authors: We appreciate the referee highlighting the need for greater explicitness here. The decomposed-and-gated reward pipeline in §3.2 applies a message-level validity gate before any rewriting step, with invalid selections receiving zero reward and being excluded from further processing. This is intended to block propagation of unfaithful content. However, we agree the current text does not provide sufficient formalization or supporting analysis of this gating behavior. In the revision we will expand §3.2 with pseudocode for the validity gate, a step-by-step description of how it interacts with the structure-aligned advantage assignment, and qualitative examples illustrating prevention of non-entailed additions. revision: yes

  2. Referee: [Experimental results] Experimental results (LoCoMo and LongMemEval-S tables): The headline performance claims rest on benchmark wins, but the text supplies no quantitative ablations on individual reward components, no error analysis of hallucination or omission rates in the distilled outputs, and no comparison against non-RL distillation baselines. Without these, it is difficult to attribute gains specifically to the RL design rather than the retrieval stage.

    Authors: We agree that these analyses are important for isolating the contribution of DistillPO. The present experiments emphasize end-to-end QA accuracy and efficiency, but do not include the requested breakdowns. In the revised manuscript we will add quantitative ablations that remove or ablate individual reward components, a dedicated error analysis reporting hallucination and omission rates via manual inspection of a representative sample of distilled outputs, and direct comparisons against non-RL distillation baselines (such as prompting-based selection and rewriting without reinforcement learning). These results will be placed in the experimental section or an appendix. revision: yes

Circularity Check

0 steps flagged

No significant circularity; results measured on external benchmarks

full rationale

The paper introduces DeferMem and its DistillPO RL procedure as a proposed workflow (high-recall retrieval followed by query-time structured distillation via selection+rewriting), then reports empirical QA accuracy, runtime, and zero commercial token cost on the independent external benchmarks LoCoMo and LongMemEval-S. These headline metrics are obtained after training and are not algebraically or statistically forced by the decomposed rewards, validity gates, or structure-aligned advantage assignment; they constitute an independent evaluation. No equations, self-citations, or uniqueness theorems appear in the supplied text that would reduce the claimed performance to a re-labeling of the training inputs. The derivation chain is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review; no explicit free parameters, axioms, or invented entities can be extracted beyond the implicit modeling choice that RL with gated rewards will produce faithful evidence.

pith-pipeline@v0.9.0 · 5786 in / 1071 out tokens · 28032 ms · 2026-05-22T07:08:14.153690+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages · 10 internal anchors

  1. [1]

    Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations

    Nuo Chen, Hongguang Li, Jianhui Chang, Juhua Huang, Baoyuan Wang, and Jia Li. Compress to impress: Unleashing the potential of compressive memory in real-world long-term conversations. InProceedings of the 31st International Conference on Computational Linguistics, pages 755–773, Abu Dhabi, UAE, 2025. Association for Computational Linguistics. URL https: ...

  2. [2]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav. Mem0: Building production-ready ai agents with scalable long-term memory, 2025. URL https: //arxiv.org/abs/2504.19413

  3. [3]

    Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Montella, Mirella Lapata, Kam-Fai Wong, and Jeff Z. Pan. Rethinking memory in LLM based agents: Representations, operations, and emerging topics, 2025. URLhttps://arxiv.org/abs/2505.00675

  4. [4]

    Pan, Yuxin Jiang, and Kam-Fai Wong

    Yiming Du, Baojun Wang, Yifan Xiang, Zhaowei Wang, Wenyu Huang, Boyang XUE, Bin Liang, Xingshan Zeng, Fei Mi, Haoli Bai, Lifeng Shang, Jeff Z. Pan, Yuxin Jiang, and Kam-Fai Wong. Memory-t1: Reinforcement learning for temporal reasoning in multi-session agents. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openr...

  5. [5]

    From Local to Global: A Graph RAG Approach to Query-Focused Summarization

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, Dasha Metropolitansky, Robert Osazuwa Ness, and Jonathan Larson. From local to global: A graph RAG approach to query-focused summarization, 2024. URL https: //arxiv.org/abs/2404.16130

  6. [6]

    Lightmem: Lightweight and efficient memory-augmented generation

    Jizhan Fang, Xinle Deng, Haoming Xu, Ziyan Jiang, Yuqi Tang, Ziwen Xu, Shumin Deng, Yunzhi Yao, Mengru Wang, Shuofei Qiao, Huajun Chen, and Ningyu Zhang. Lightmem: Lightweight and efficient memory-augmented generation. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum? id=dyJ0GWpjJB

  7. [7]

    ReTool: Reinforcement Learning for Strategic Tool Use in LLMs

    Jiazhan Feng, Shijue Huang, Xingwei Qu, Ge Zhang, Yujia Qin, Baoquan Zhong, Chengquan Jiang, Jinxin Chi, and Wanjun Zhong. Retool: Reinforcement learning for strategic tool use in llms, 2025. URLhttps://arxiv.org/abs/2504.11536

  8. [8]

    Daya Guo, Dejian Yang, Haowei Zhang, Junxiao Song, Peiyi Wang, Qihao Zhu, Runxin Xu, Ruoyu Zhang, Shirong Ma, Xiao Bi, Xiaokang Zhang, Xingkai Yu, Yu Wu, Z. F. Wu, Zhibin Gou, Zhihong Shao, Zhuoshu Li, Ziyi Gao, Aixin Liu, Bing Xue, Bingxuan Wang, Bochao Wu, Bei Feng, Chengda Lu, Chenggang Zhao, Chengqi Deng, Chong Ruan, Damai Dai, Deli Chen, Dongjie Ji, ...

  9. [9]

    LightRAG: Simple and fast retrieval-augmented generation

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. LightRAG: Simple and fast retrieval-augmented generation. InFindings of the Association for Computational Linguistics: EMNLP 2025, pages 10746–10761, Suzhou, China, 2025. Association for Computational Linguistics. doi: 10.18653/v1/2025.findings-emnlp.568. URL https://aclanthology.org/ 2025.finding...

  10. [10]

    From RAG to memory: Non-parametric continual learning for large language models

    Bernal Jiménez Gutiérrez, Yiheng Shu, Weijian Qi, Sizhe Zhou, and Yu Su. From RAG to memory: Non-parametric continual learning for large language models. InForty-second International Conference on Machine Learning, 2025. URL https://openreview.net/ forum?id=LWH8yn4HS2

  11. [11]

    Memory in the age of AI agents,

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, Honglin Guo, Shihan Dou, Zhiheng Xi, Senjie Jin, Jiejun Tan, Yanbin Yin, Jiongnan Liu, Zeyu Zhang, Zhongxiang Sun, Yutao Zhu, Hao Sun, Boci Peng, Zhenrong Cheng, Xuanbo Fan, Jiaxin Guo, Xinlei Yu, Zhenhong Zhou, Zewen Hu, Jiahao Huo, Junhao Wang, Yuwei Niu, Yu Wang, Zhe...

  12. [12]

    URLhttps://arxiv.org/abs/2512.13564

  13. [13]

    Wei-Chieh Huang, Weizhi Zhang, Yueqing Liang, Yuanchen Bei, Yankai Chen, Tao Feng, Xinyu Pan, Zhen Tan, Yu Wang, Tianxin Wei, Shanglin Wu, Ruiyao Xu, Liangwei Yang, Rui Yang, Wooseong Yang, Chin-Yuan Yeh, Hanrong Zhang, Haozhen Zhang, Siqi Zhu, Henry Peng Zou, Wanjia Zhao, Song Wang, Wujiang Xu, Zixuan Ke, Zheng Hui, Dawei Li, Yaozu Wu, Langzhou He, Chen ...

  14. [14]

    WAGLE: Strategic weight attribution for effective and modular unlearning in large language models

    Jinghan Jia, Jiancheng Liu, Yihua Zhang, Parikshit Ram, Nathalie Baracaldo, and Sijia Liu. WAGLE: Strategic weight attribution for effective and modular unlearning in large language models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems,

  15. [15]

    URLhttps://openreview.net/forum?id=VzOgnDJMgh

  16. [16]

    The AI hippocampus: How far are we from human memory?Transactions on Machine Learning Research, 2025

    Zixia Jia, Jiaqi Li, Yipeng Kang, Yuxuan Wang, Tong Wu, Quansen Wang, Xiaobo Wang, Shuyi Zhang, Junzhe Shen, Qing Li, Siyuan Qi, Yitao Liang, Di He, Zilong Zheng, and Song-Chun Zhu. The AI hippocampus: How far are we from human memory?Transactions on Machine Learning Research, 2025. URLhttps://openreview.net/forum?id=Sk7pwmLuAY

  17. [17]

    Graph chain-of-thought: Augmenting large language models by reasoning on graphs

    Bowen Jin, Chulin Xie, Jiawei Zhang, Kashob Kumar Roy, Yu Zhang, Zheng Li, Ruirui Li, Xianfeng Tang, Suhang Wang, Yu Meng, and Jiawei Han. Graph chain-of-thought: Augmenting large language models by reasoning on graphs. InFindings of the Association for Compu- tational Linguistics: ACL 2024, pages 163–184, Bangkok, Thailand, 2024. Association for Computat...

  18. [18]

    Search-R1: Training LLMs to Reason and Leverage Search Engines with Reinforcement Learning

    Bowen Jin, Hansi Zeng, Zhenrui Yue, Jinsung Yoon, Sercan Arik, Dong Wang, Hamed Za- mani, and Jiawei Han. Search-r1: Training llms to reason and leverage search engines with reinforcement learning, 2025. URLhttps://arxiv.org/abs/2503.09516

  19. [19]

    Memory OS of AI agent

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. Memory OS of AI agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 25961– 25970, Suzhou, China, 2025. Association for Computational Linguistics. URL https://doi. org/10.18653/v1/2025.emnlp-main.1318. 11

  20. [20]

    A human-inspired reading agent with gist memory of very long contexts

    Kuang-Huei Lee, Xinyun Chen, Hiroki Furuta, John Canny, and Ian Fischer. A human-inspired reading agent with gist memory of very long contexts. InProceedings of the 41st Interna- tional Conference on Machine Learning, volume 235 ofProceedings of Machine Learning Research, pages 26396–26415. PMLR, 2024. URL https://proceedings.mlr.press/ v235/lee24c.html

  21. [21]

    Hello again! LLM-powered personalized agent for long-term dialogue

    Hao Li, Chenghao Yang, An Zhang, Yang Deng, Xiang Wang, and Tat-Seng Chua. Hello again! LLM-powered personalized agent for long-term dialogue. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pages 5259–5276, Albuquerque, New M...

  22. [22]

    StructRAG: Boosting knowledge intensive reasoning of LLMs via inference-time hybrid information structurization

    Zhuoqun Li, Xuanang Chen, Haiyang Yu, Hongyu Lin, Yaojie Lu, Qiaoyu Tang, Fei Huang, Xianpei Han, Le Sun, and Yongbin Li. StructRAG: Boosting knowledge intensive reasoning of LLMs via inference-time hybrid information structurization. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=GhexuBLxbO

  23. [23]

    and Lin, Kevin and Hewitt, John and Paranjape, Ashwin and Bevilacqua, Michele and Petroni, Fabio and Liang, Percy

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts.Transactions of the Association for Computational Linguistics, 12:157–173, 2024. URL https://doi.org/ 10.1162/tacl_a_00638

  24. [24]

    Evaluating very long-term conversational memory of LLM agents

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. Evaluating very long-term conversational memory of LLM agents. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 13851–13870, Bangkok, Thailand, 2024. Association for Computational Linguis...

  25. [25]

    Towards lifelong dialogue agents via timeline-based memory management

    Kai Tzu-iunn Ong, Namyoung Kim, Minju Gwak, Hyungjoo Chae, Taeyoon Kwon, Yohan Jo, Seung-won Hwang, Dongha Lee, and Jinyoung Yeo. Towards lifelong dialogue agents via timeline-based memory management. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (V...

  26. [26]

    URL https://aclanthology.org/2025

    Association for Computational Linguistics. URL https://aclanthology.org/2025. naacl-long.435/

  27. [27]

    MemGPT: Towards LLMs as Operating Systems

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. Memgpt: Towards llms as operating systems, 2024. URL https: //arxiv.org/abs/2310.08560

  28. [28]

    Vicky Zhao, Lili Qiu, and Dongmei Zhang

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Menglin Xia, Xufang Luo, Jue Zhang, Qingwei Lin, Victor Rühle, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Dongmei Zhang. LLMLingua-2: Data distillation for efficient and faithful task-agnostic prompt compression. In Findings of the Association for Computational Linguistics: ACL 2024, pages 963–981, Ban...

  29. [29]

    Vicky Zhao, Lili Qiu, and Jianfeng Gao

    Zhuoshi Pan, Qianhui Wu, Huiqiang Jiang, Xufang Luo, Hao Cheng, Dongsheng Li, Yuqing Yang, Chin-Yew Lin, H. Vicky Zhao, Lili Qiu, and Jianfeng Gao. Secom: On memory construction and retrieval for personalized conversational agents. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview.net/forum? id=xKDZAW0He3

  30. [30]

    Direct preference optimization: Your language model is secretly a reward model

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. Direct preference optimization: Your language model is secretly a reward model. InThirty-seventh Conference on Neural Information Processing Systems, 2023. URL https://openreview.net/forum?id=HPuSIXJaa9

  31. [31]

    From isolated conversations to hierarchical schemas: Dynamic tree memory representation for LLMs

    Alireza Rezazadeh, Zichao Li, Wei Wei, and Yujia Bao. From isolated conversations to hierarchical schemas: Dynamic tree memory representation for LLMs. InThe Thirteenth 12 International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=moXtEmCleY

  32. [32]

    Beyond pipelines: A survey of the paradigm shift toward model-native agentic ai, 2025

    Jitao Sang, Jinlin Xiao, Jiarun Han, Jilin Chen, Xiaoyi Chen, Shuyu Wei, Yongjie Sun, and Yuhang Wang. Beyond pipelines: A survey of the paradigm shift toward model-native agentic ai, 2025. URLhttps://arxiv.org/abs/2510.16720

  33. [33]

    Proximal Policy Optimization Algorithms

    John Schulman, Filip Wolski, Prafulla Dhariwal, Alec Radford, and Oleg Klimov. Proximal policy optimization algorithms, 2017. URLhttps://arxiv.org/abs/1707.06347

  34. [34]

    Zhihong Shao, Peiyi Wang, Qihao Zhu, Runxin Xu, Junxiao Song, Xiao Bi, Haowei Zhang, Mingchuan Zhang, Y . K. Li, Y . Wu, and Daya Guo. Deepseekmath: Pushing the limits of mathematical reasoning in open language models, 2024. URL https://arxiv.org/abs/ 2402.03300

  35. [35]

    REMem: Reasoning with episodic memory in language agent

    Yiheng Shu, Saisri Padmaja Jonnalagedda, Xiang Gao, Bernal Jiménez Gutiérrez, Weijian Qi, Kamalika Das, Huan Sun, and Yu Su. REMem: Reasoning with episodic memory in language agent. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview.net/forum?id=fugnQxbvMm

  36. [36]

    Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization.arXiv preprint arXiv:2512.07478, 2025

    Jianghao Su, Xia Zeng, Luhui Liu, Chao Luo, Ye Chen, and Zhuoran Zhuang. Enhancing agentic rl with progressive reward shaping and value-based sampling policy optimization, 2026. URLhttps://arxiv.org/abs/2512.07478

  37. [37]

    H-MEM: Hierarchical memory for high-efficiency long-term reasoning in LLM agents

    Haoran Sun, Shaoning Zeng, and Bob Zhang. H-MEM: Hierarchical memory for high-efficiency long-term reasoning in LLM agents. InProceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers), pages 341–350, Rabat, Morocco, 2026. Association for Computational Linguistics. URL https: //doi.o...

  38. [38]

    In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents

    Zhen Tan, Jun Yan, I-Hung Hsu, Rujun Han, Zifeng Wang, Long Le, Yiwen Song, Yanfei Chen, Hamid Palangi, George Lee, Anand Rajan Iyer, Tianlong Chen, Huan Liu, Chen-Yu Lee, and Tomas Pfister. In prospect and retrospect: Reflective memory management for long-term personalized dialogue agents. InProceedings of the 63rd Annual Meeting of the Association for C...

  39. [39]

    URL https://aclanthology.org/2025

    Association for Computational Linguistics. URL https://aclanthology.org/2025. acl-long.413/

  40. [40]

    TRL: Transformers Re- inforcement Learning

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tristan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gallouédec. TRL: Transformers Re- inforcement Learning. https://github.com/huggingface/trl, 2020. Software library, Apache-2.0 license

  41. [41]

    Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025

    Qingyue Wang, Yanhe Fu, Yanan Cao, Shuai Wang, Zhiliang Tian, and Liang Ding. Recursively summarizing enables long-term dialogue memory in large language models.Neurocomputing, 639:130193, 2025. URLhttps://doi.org/10.1016/j.neucom.2025.130193

  42. [42]

    Beyond the limits: A survey of techniques to extend the context length in large language models

    Xindi Wang, Mahsa Salmani, Parsa Omidi, Xiangyu Ren, Mehdi Rezagholizadeh, and Ar- maghan Eshaghi. Beyond the limits: A survey of techniques to extend the context length in large language models. InProceedings of the Thirty-Third International Joint Conference on Artificial Intelligence, IJCAI-24, pages 8299–8307. International Joint Conferences on Artifi...

  43. [43]

    Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs

    Xumeng Wen, Zihan Liu, Shun Zheng, Shengyu Ye, Zhirong Wu, Yang Wang, Zhijian Xu, Xiao Liang, Junjie Li, Ziming Miao, Jiang Bian, and Mao Yang. Reinforcement learning with verifiable rewards implicitly incentivizes correct reasoning in base LLMs. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openreview. net/forum...

  44. [44]

    Long- memeval: Benchmarking chat assistants on long-term interactive memory

    Di Wu, Hongwei Wang, Wenhao Yu, Yuwei Zhang, Kai-Wei Chang, and Dong Yu. Long- memeval: Benchmarking chat assistants on long-term interactive memory. InThe Thirteenth International Conference on Learning Representations, 2025. URL https://openreview. net/forum?id=pZiyCaVuti. 13

  45. [45]

    From Human Memory to AI Memory: A Survey on Memory Mechanisms in the Era of LLMs

    Yaxiong Wu, Sheng Liang, Chen Zhang, Yichao Wang, Yongyue Zhang, Huifeng Guo, Ruiming Tang, and Yong Liu. From human memory to ai memory: A survey on memory mechanisms in the era of LLMs, 2025. URLhttps://arxiv.org/abs/2504.15965

  46. [46]

    DaGRPO: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization

    Xuan Xie, Xuan Wang, and Wenjie Wang. DaGRPO: Rectifying gradient conflict in reasoning via distinctiveness-aware group relative policy optimization. InLogical and Symbolic Reasoning in Language Models @ AAAI 2026, 2026. URL https://openreview.net/forum?id= SucCwKlD9k

  47. [47]

    From single to multi- granularity: Toward long-term memory association and selection of conversational agents

    Derong Xu, Yi Wen, Pengyue Jia, Yingyi Zhang, Wenlin Zhang, Yichao Wang, Huifeng Guo, Ruiming Tang, Xiangyu Zhao, Enhong Chen, and Tong Xu. From single to multi- granularity: Toward long-term memory association and selection of conversational agents. In The Fourteenth International Conference on Learning Representations, 2026. URL https: //openreview.net/...

  48. [48]

    RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation

    Fangyuan Xu, Weijia Shi, and Eunsol Choi. RECOMP: Improving retrieval-augmented LMs with context compression and selective augmentation. InThe Twelfth International Con- ference on Learning Representations, 2024. URL https://openreview.net/forum?id= mlJLVigNHp

  49. [49]

    A-mem: Agentic memory for LLM agents

    Wujiang Xu, Zujie Liang, Kai Mei, Hang Gao, Juntao Tan, and Yongfeng Zhang. A-mem: Agentic memory for LLM agents. InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=FiM0M8gcct

  50. [50]

    B. Y . Yan, Chaofan Li, Hongjin Qian, Shuqi Lu, and Zheng Liu. General agentic memory via deep research, 2025. URLhttps://arxiv.org/abs/2511.18423

  51. [51]

    Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

    Sikuan Yan, Xiufeng Yang, Zuchao Huang, Ercong Nie, Zifeng Ding, Zonggen Li, Xiaowen Ma, Jinhe Bi, Kristian Kersting, Jeff Z. Pan, Hinrich Schütze, V olker Tresp, and Yunpu Ma. Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning, 2026. URLhttps://arxiv.org/abs/2508.19828

  52. [52]

    DAPO: An open-source LLM reinforcement learning system at scale

    Qiying Yu, Zheng Zhang, Ruofei Zhu, Yufeng Yuan, Xiaochen Zuo, Yu Yue, Weinan Dai, Tiantian Fan, Gaohong Liu, Juncai Liu, LingJun Liu, Xin Liu, Haibin Lin, Zhiqi Lin, Bole Ma, Guangming Sheng, Yuxuan Tong, Chi Zhang, Mofan Zhang, Ru Zhang, Wang Zhang, Hang Zhu, Jinhua Zhu, Jiaze Chen, Jiangjie Chen, Chengyi Wang, Hongli Yu, Yuxuan Song, Xiangpeng Wei, Hao...

  53. [53]

    Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026

    Yang Yue, Zhiqi Chen, Rui Lu, Andrew Zhao, Zhaokai Wang, Yang Yue, Shiji Song, and Gao Huang. Does reinforcement learning really incentivize reasoning capacity in LLMs beyond the base model? InThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2026. URLhttps://openreview.net/forum?id=4OsgYD7em5

  54. [54]

    The landscape of agentic reinforcement learning for llms: A survey.Transactions on Machine Learning Research, 2026,

    Guibin Zhang, Hejia Geng, Xiaohang Yu, Zhenfei Yin, Zaibin Zhang, Zelin Tan, Heng Zhou, Zhong-Zhi Li, Xiangyuan Xue, Yijiang Li, Yifan Zhou, Yang Chen, Chen Zhang, Yutao Fan, Zihu Wang, Songtao Huang, Francisco Piedrahita Velez, Yue Liao, Hongru Wang, Mengyue Yang, Heng Ji, Jun Wang, Shuicheng Yan, Philip Torr, and Lei Bai. The landscape of agentic reinfo...

  55. [55]

    URLhttps://openreview.net/forum?id=RY19y2RI1O

  56. [56]

    Assomem: Scalable memory QA with multi-signal associative retrieval

    Kai Zhang, Xinyuan Zhang, Ejaz Ahmed, Hongda Jiang, Caleb Kumar, Kai Sun, Zhaojiang Lin, Sanat Sharma, Shereen Oraby, AARON COLAK, Ahmed A Aly, Anuj Kumar, Xiaozhong Liu, and Xin Luna Dong. Assomem: Scalable memory QA with multi-signal associative retrieval. InThe Fourteenth International Conference on Learning Representations, 2026. URL https://openrevie...

  57. [57]

    Bridging intuitive associations and de- liberate recall: Empowering LLM personal assistant with graph-structured long-term mem- ory

    Yujie Zhang, Weikang Yuan, and Zhuoren Jiang. Bridging intuitive associations and de- liberate recall: Empowering LLM personal assistant with graph-structured long-term mem- ory. InFindings of the Association for Computational Linguistics: ACL 2025, pages 17533–17547, Vienna, Austria, 2025. Association for Computational Linguistics. URL https://aclantholo...

  58. [58]

    A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43(6):155:1–155:47, 2025

    Zeyu Zhang, Quanyu Dai, Xiaohe Bo, Chen Ma, Rui Li, Xu Chen, Jieming Zhu, Zhenhua Dong, and Ji-Rong Wen. A survey on the memory mechanism of large language model- based agents.ACM Transactions on Information Systems, 43(6):155:1–155:47, 2025. URL https://doi.org/10.1145/3748302

  59. [59]

    Adversarial eval

    Wanjun Zhong, Lianghong Guo, Qiqi Gao, He Ye, and Yanlin Wang. Memorybank: enhancing large language models with long-term memory. InProceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence, pages 19724–19731. AAAI Press, 2024. URL https: //doi.org/10.1609/aaai.v38i17.29946. 15 A Datasets and Baseline Methods A.1 Datasets. We evaluate lo...

  60. [60]

    Consider each message one by one

    Read the question and the conversation history. Consider each message one by one

  61. [61]

    msg_id" to

    Add a message’s "msg_id" to "useful_msg" ONLY if that message is actually useful for answering the question

  62. [62]

    msg_id" in

    For every "msg_id" in "useful_msg", add exactly one entry to "distilled_info": - "msg_id": the same id - "info": a single self-contained statement (or a compact set of statements). Includes all information from the target message (i.e., the message of the same msg_id) that is useful for answering the question

  63. [63]

    info" self-contained: - Conduct reference resolution (pronouns, ellipsis, named entities) when the referent is unambiguous in its surrounding context. - Interpret

    Make "info" self-contained: - Conduct reference resolution (pronouns, ellipsis, named entities) when the referent is unambiguous in its surrounding context. - Interpret "I/we/my" from the perspective of the message’s speaker and interpret "you/your" as the conversational counterpart, unless the context indicates reported speech

  64. [64]

    education field

    Each "info" entry must be grounded primarily in the message of the same msg_id, plus minimal preceding discourse context when necessary. - You may use nearby preceding messages in the same segment for two limited purposes: (a) Reference resolution: resolve pronouns/ellipsis when unambiguous. (b) Discourse-context restoration: recover the minimal preceding...

  65. [65]

    this message is useful because

    Do NOT include meta commentary (e.g., "this message is useful because...") in "info"

  66. [66]

    useful_msg

    If the conversation history contains no information useful for answering the question, output: {{ "useful_msg": [], "distilled_info": [] }} Distiller user prompt [CONVERSATION HISTORY FORMAT] - The conversation history is a list of conversation segments. - Each segment is a list of messages. - Each message has the following fields: - ‘msg_id‘: the id of t...

  67. [67]

    Info_extracted: a list of {msg_id, info}

  68. [68]

    user", "assistant

    Original_Segs: a list of conversation segments containing the original messages. Data format of Original_Segs: - The conversation history is a list of conversation segments. - Each segment is a list of messages. - Each message has the following fields: - ‘msg_id‘: the id of the message. - ‘speaker‘ (optional): who said this message (e.g., "user", "assista...

  69. [69]

    resolve references/pronouns in the TARGET MESSAGE or in the info

  70. [70]

    I/we/my" are from the perspective of the TARGET MESSAGE’s speaker. -

    recover the minimal conversational context needed to interpret what the TARGET MESSAGE is responding to. - Speaker viewpoint rule: - Pronouns like "I/we/my" are from the perspective of the TARGET MESSAGE’s speaker. - "you/your" refers to the conversational counterpart unless context indicates otherwise. - For reported speech/quotations, resolve pronouns b...