pith. sign in

arxiv: 2602.06470 · v2 · pith:YXJGC6N4new · submitted 2026-02-06 · 💻 cs.CL · cs.AI

Improve Large Language Model Systems with User Logs

Pith reviewed 2026-05-21 14:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords user logslarge language modelscontinual learningfeedback distillationcognitive gapretrieval augmented generationpreference optimization
0
0 comments X

The pith

UNO turns noisy user logs into rules and preferences that let LLM systems adaptively improve responses.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents UNO as a unified way to make large language model systems better by learning directly from the logs of real user interactions. It first extracts semi-structured rules and preference pairs from the raw logs, then clusters the data around queries and feedback signals, and finally measures the gap between what the model already knows and what the logs show. This gap guides the system to drop noisy parts of the feedback and build separate handling modules for primary experiences and reflective ones. A reader would care because it shifts improvement away from ever-larger training runs toward continual, low-cost adaptation using data that is already being collected in deployment.

Core claim

UNO distills unstructured user logs into semi-structured rules and preference pairs, applies query-and-feedback-driven clustering to handle data heterogeneity, quantifies the cognitive gap between the model's prior knowledge and the log content, and uses that assessment to filter noisy feedback while constructing distinct modules for primary and reflective experiences extracted from the logs, thereby improving future LLM system responses.

What carries the argument

The UNO framework, which distills logs into rules and preferences, clusters them by query and feedback, and quantifies the cognitive gap to adaptively filter noise and build experience modules.

If this is right

  • LLM systems using UNO achieve higher effectiveness and efficiency than both retrieval-augmented generation and memory-based methods on the tested tasks.
  • Cognitive-gap measurement allows the system to discard portions of user feedback judged too far from the model's existing knowledge.
  • Primary and reflective experience modules can be constructed separately from the same log stream to handle different types of user signals.
  • The off-policy optimization problem between log collection and model updates is addressed through the distillation and clustering pipeline.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Production LLM services could shift from periodic retraining to continuous, log-driven updates that require far less new human annotation.
  • The same distillation-plus-gap approach might transfer to non-LLM agents that maintain long interaction histories with users.
  • If the clustering step proves robust, similar log-processing pipelines could be applied to other noisy human-generated data streams such as customer support transcripts.

Load-bearing premise

User logs contain extractable, authentic human feedback signals that can be reliably distilled into rules and preference pairs without introducing new biases or noise that the clustering and cognitive-gap steps cannot handle.

What would settle it

Run the same set of user logs through UNO and through a standard RAG baseline while deliberately adding increasing levels of random or contradictory feedback entries, then measure whether UNO's accuracy and efficiency gains disappear once noise exceeds a measurable threshold.

Figures

Figures reproduced from arXiv: 2602.06470 by Changyue Wang, Qingyao Ai, Weihang Su, Yiqun Liu.

Figure 1
Figure 1. Figure 1: The workflow of UNO. UNO first distills and filters raw user logs, then performs clustering and a cognitive gap [PITH_FULL_IMAGE:figures/full_fig_p004_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of extra input tokens versus performance (Norm-Score). Extra tokens are computed using the Qwen3-8B [PITH_FULL_IMAGE:figures/full_fig_p008_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Results of online evolution settings on phi-4 model. [PITH_FULL_IMAGE:figures/full_fig_p009_3.png] view at source ↗
read the original abstract

Scaling training data and model parameters has long driven progress in large language models (LLMs), but this paradigm is increasingly constrained by the scarcity of high-quality data and diminishing returns from rising computational costs. As a result, recent work is increasing the focus on continual learning from real-world deployment, where user interaction logs provide a rich source of authentic human feedback and procedural knowledge. However, learning from user logs is challenging due to their unstructured and noisy nature. Vanilla LLM systems often struggle to distinguish useful feedback signals from noisy user behavior, and the disparity between user log collection and model optimization (e.g., the off-policy optimization problem) further strengthens the problem. To this end, we propose UNO (User log-driveN Optimization), a unified framework for improving LLM systems (LLMsys) with user logs. UNO first distills logs into semi-structured rules and preference pairs, then employs query-and-feedback-driven clustering to manage data heterogeneity, and finally quantifies the cognitive gap between the model's prior knowledge and the log data. This assessment guides the LLMsys to adaptively filter out noisy feedback and construct different modules for primary and reflective experiences extracted from user logs, thereby improving future responses. Extensive experiments show that UNO achieves state-of-the-art effectiveness and efficiency, significantly outperforming Retrieval Augmented Generation (RAG) and memory-based baselines. We have open-sourced our code at https://github.com/bebr2/UNO .

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes UNO (User log-driveN Optimization), a unified framework for improving LLM systems via user interaction logs. UNO distills unstructured logs into semi-structured rules and preference pairs, applies query-and-feedback-driven clustering to address data heterogeneity, and quantifies the cognitive gap between the model's prior knowledge and the log data. This gap assessment is used to adaptively filter noisy feedback and construct separate modules for primary and reflective experiences, with the goal of improving future responses. The authors report that extensive experiments demonstrate SOTA effectiveness and efficiency, significantly outperforming RAG and memory-based baselines, and have open-sourced the code.

Significance. If the central claims hold after addressing validation concerns, the work could meaningfully advance continual learning from real-world deployment logs by turning noisy user interactions into usable signals for LLM optimization. The open-sourced code supports reproducibility, which strengthens the contribution in an empirical field.

major comments (2)
  1. [§3.3] §3.3 (Cognitive Gap Quantification): The pipeline relies on the cognitive-gap metric to guide adaptive filtering and module construction, yet the manuscript provides no independent validation (e.g., human annotation, held-out oracle, or correlation with downstream usefulness) that this metric extracts genuine signal rather than model self-consistency. If the gap is derived from the target LLM's own embeddings or outputs, the filtering step risks circularity, which would undermine the reported gains over RAG and memory baselines.
  2. [§4] §4 (Experiments): The SOTA claim is load-bearing for the paper's contribution, but the experimental section does not include ablation studies isolating the contribution of the cognitive-gap step versus the distillation and clustering stages alone. Without these controls, it is unclear whether the outperformance is attributable to the proposed framework or to other implementation details.
minor comments (2)
  1. [Abstract] Abstract: The claim of 'significantly outperforming' RAG and memory baselines would be strengthened by reporting concrete metrics (e.g., accuracy deltas, latency reductions) rather than qualitative language.
  2. [§2] §2 (Related Work): The discussion of off-policy optimization challenges could more explicitly contrast UNO with prior log-based continual learning methods to clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed review. The comments highlight important aspects of validation and experimental rigor that we address below. We have revised the manuscript to incorporate additional analyses and clarifications.

read point-by-point responses
  1. Referee: [§3.3] §3.3 (Cognitive Gap Quantification): The pipeline relies on the cognitive-gap metric to guide adaptive filtering and module construction, yet the manuscript provides no independent validation (e.g., human annotation, held-out oracle, or correlation with downstream usefulness) that this metric extracts genuine signal rather than model self-consistency. If the gap is derived from the target LLM's own embeddings or outputs, the filtering step risks circularity, which would undermine the reported gains over RAG and memory baselines.

    Authors: We acknowledge the validity of this concern about potential circularity and the absence of explicit independent validation in the original submission. The cognitive gap is computed as the divergence between the base model's prior knowledge representations and the knowledge encoded in the distilled log data, using embedding similarity from the target LLM. To strengthen this, the revised manuscript adds a new subsection in §3.3 reporting Pearson correlations between cognitive-gap scores and downstream task improvements on held-out queries. We also include a small-scale human annotation study (n=200 samples) where annotators rate the usefulness of filtered vs. unfiltered logs, showing statistically significant alignment with the metric. To further reduce any self-consistency risk, we now compute the gap using a separate, frozen embedding model distinct from the target LLM. These additions demonstrate that the metric captures genuine signal beyond model-internal consistency. revision: yes

  2. Referee: [§4] §4 (Experiments): The SOTA claim is load-bearing for the paper's contribution, but the experimental section does not include ablation studies isolating the contribution of the cognitive-gap step versus the distillation and clustering stages alone. Without these controls, it is unclear whether the outperformance is attributable to the proposed framework or to other implementation details.

    Authors: We agree that isolating the contribution of the cognitive-gap quantification is necessary to substantiate the framework's gains. The revised §4 now includes dedicated ablation experiments: (1) UNO without cognitive-gap filtering (replaced by fixed-threshold or random filtering), (2) distillation + clustering only, and (3) full UNO. Results on the primary benchmarks show that removing the cognitive-gap step reduces performance by 4-7% relative to full UNO while still outperforming RAG and memory baselines, confirming its additive value. We also report efficiency metrics for each ablation to address the efficiency claims. These controls clarify that the reported SOTA results stem from the integrated framework rather than isolated implementation choices. revision: yes

Circularity Check

0 steps flagged

Empirical pipeline with no load-bearing circular derivation or self-referential reduction

full rationale

The paper describes UNO as a three-stage empirical processing pipeline (distillation of logs into rules/preference pairs, query-feedback clustering, and cognitive-gap quantification to guide adaptive filtering and module construction). No equations, fitted parameters, or derivations are presented that reduce the claimed SOTA gains to a self-referential definition or construction. The central claims rest on experimental comparisons to RAG and memory baselines rather than a mathematical chain that collapses to its inputs. Any potential self-bias in the gap metric would require explicit confirmation from the full text that the quantification uses only the target LLM's own outputs in a closed loop; absent such a quoted reduction, the framework remains self-contained as an applied method.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the framework implicitly assumes that user logs are a rich source of procedural knowledge and that the proposed distillation and filtering steps preserve signal without new artifacts.

pith-pipeline@v0.9.0 · 5788 in / 1124 out tokens · 36977 ms · 2026-05-21T14:30:07.693426+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Skill Retrieval Augmentation for Agentic AI

    cs.CL 2026-04 unverdicted novelty 7.0

    Agents improve when they retrieve skills on demand from large corpora, yet current models cannot selectively decide when to load or ignore a retrieved skill.

Reference graph

Works this paper leans on

51 extracted references · 51 canonical work pages · cited by 1 Pith paper · 18 internal anchors

  1. [1]

    Phi-4 Technical Report

    Marah Abdin, Jyoti Aneja, Harkirat Behl, Sébastien Bubeck, Ronen Eldan, Suriya Gunasekar, Michael Harrison, Russell J. Hewett, Mojan Javaheripi, Piero Kauffmann, James R. Lee, Yin Tat Lee, et al . 2024. Phi-4 Technical Report. arXiv:2412.08905 [cs.CL] https://arxiv.org/abs/2412.08905

  2. [2]

    Qingyao Ai, Yichen Tang, Changyue Wang, Jianming Long, Weihang Su, and Yiqun Liu. 2025. MemoryBench: A Benchmark for Memory and Continual Learning in LLM Systems. arXiv:2510.17281 [cs.LG] https://arxiv.org/abs/2510. 17281

  3. [3]

    Huan ang Gao, Jiayi Geng, Wenyue Hua, Mengkang Hu, Xinzhe Juan, Hongzhang Liu, Shilong Liu, Jiahao Qiu, Xuan Qi, et al . 2025. A Survey of Self-Evolving Agents: On Path to Artificial Super Intelligence. arXiv:2507.21046 [cs.AI] https: //arxiv.org/abs/2507.21046

  4. [4]

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

  5. [5]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory.arXiv preprint arXiv:2504.19413(2025)

  6. [6]

    2010.Search engines: Information retrieval in practice

    W Bruce Croft, Donald Metzler, Trevor Strohman, et al. 2010.Search engines: Information retrieval in practice. Vol. 520. Addison-Wesley Reading

  7. [7]

    2012.Experience and nature

    John Dewey. 2012.Experience and nature. Courier Corporation

  8. [8]

    Qian Dong, Qingyao Ai, Hongning Wang, Yiding Liu, Haitao Li, Weihang Su, Yiqun Liu, Tat-Seng Chua, and Shaoping Ma. 2025. Decoupling Knowledge and Context: An Efficient and Effective Retrieval Augmented Generation Framework via Cross Attention. InProceedings of the ACM on Web Conference 2025. 4386– 4395

  9. [9]

    Yan Fang, Jingtao Zhan, Qingyao Ai, Jiaxin Mao, Weihang Su, Jia Chen, and Yiqun Liu. 2024. Scaling laws for dense retrieval. InProceedings of the 47th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1339–1349

  10. [10]

    Tongtong Feng, Xin Wang, Zekai Zhou, Ren Wang, Yuwei Zhan, Guangyao Li, Qing Li, and Wenwu Zhu. 2025. EvoAgent: Self-evolving Agent with Continual World Model for Long-Horizon Tasks. arXiv:2502.05907 [cs.RO] https://arxiv. org/abs/2502.05907

  11. [11]

    Eric Han, Jun Chen, Karthik Abinav Sankararaman, Xiaoliang Peng, Tengyu Xu, Eryk Helenowski, Kaiyan Peng, Mrinal Kumar, Sinong Wang, Han Fang, and Arya Talebzadeh. 2025. Reinforcement Learning from User Feedback. arXiv:2505.14946 [cs.AI] https://arxiv.org/abs/2505.14946

  12. [12]

    Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen

    Edward J. Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Lu Wang, and Weizhu Chen. 2022. LoRA: Low-Rank Adaptation of Large Language Models. InThe Tenth International Conference on Learning Representations, ICLR 2022, Virtual Event, April 25-29, 2022. OpenReview.net. https://openreview.net/forum?id=nZeVKeeFYf9

  13. [13]

    Yuyang Hu, Shichun Liu, Yanwei Yue, Guibin Zhang, Boyang Liu, Fangyi Zhu, Jiahang Lin, et al. 2025. Memory in the Age of AI Agents. arXiv:2512.13564 [cs.CL] https://arxiv.org/abs/2512.13564

  14. [14]

    Thorsten Joachims. 2002. Optimizing search engines using clickthrough data. In Proceedings of the Eighth ACM SIGKDD International Conference on Knowledge Discovery and Data Mining(Edmonton, Alberta, Canada)(KDD ’02). Association for Computing Machinery, New York, NY, USA, 133–142. doi:10.1145/775047. 775067

  15. [15]

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. 2025. Memory OS of AI Agent. InProceedings of the 2025 Conference on Empirical Methods in Natural Lan- guage Processing, Christos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou, China, 25961–25970. doi:10.18653/v1/2025.emnlp...

  16. [16]

    Scaling Laws for Neural Language Models

    Jared Kaplan, Sam McCandlish, Tom Henighan, Tom B. Brown, Benjamin Chess, Rewon Child, Scott Gray, Alec Radford, Jeffrey Wu, and Dario Amodei. 2020. Scaling Laws for Neural Language Models. arXiv:2001.08361 [cs.LG] https: //arxiv.org/abs/2001.08361

  17. [17]

    Diane Kelly and Jaime Teevan. 2003. Implicit feedback for inferring user prefer- ence: a bibliography.SIGIR Forum37, 2 (Sept. 2003), 18–28. doi:10.1145/959258. 959260

  18. [18]

    Gonzalez, Hao Zhang, and Ion Stoica

    Woosuk Kwon, Zhuohan Li, Siyuan Zhuang, Ying Sheng, Lianmin Zheng, Cody Hao Yu, Joseph E. Gonzalez, Hao Zhang, and Ion Stoica. 2023. Efficient Mem- ory Management for Large Language Model Serving with PagedAttention. In Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles

  19. [19]

    Sergey Levine, Aviral Kumar, George Tucker, and Justin Fu. 2020. Offline Re- inforcement Learning: Tutorial, Review, and Perspectives on Open Problems. arXiv:2005.01643 [cs.LG] https://arxiv.org/abs/2005.01643

  20. [20]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, et al. 2020. Retrieval-augmented generation for knowledge-intensive nlp tasks. Advances in Neural Information Processing Systems33 (2020), 9459–9474

  21. [21]

    Dawei Li, Bohan Jiang, Liangjie Huang, Alimohammad Beigi, Chengshuai Zhao, Zhen Tan, Amrita Bhattacharjee, Yuxuan Jiang, Canyu Chen, Tianhao Wu, Kai Shu, Lu Cheng, and Huan Liu. 2025. From Generation to Judgment: Opportunities and Challenges of LLM-as-a-judge. arXiv:2411.16594 [cs.AI] https://arxiv.org/ abs/2411.16594

  22. [22]

    Haitao Li, Qian Dong, Junjie Chen, Huixue Su, Yujia Zhou, Qingyao Ai, Ziyi Ye, and Yiqun Liu. 2024. LLMs-as-Judges: A Comprehensive Survey on LLM-based Evaluation Methods. arXiv:2412.05579 [cs.CL] https://arxiv.org/abs/2412.05579

  23. [23]

    Zhiyu Li, Chenyang Xi, Chunyu Li, Ding Chen, Boyu Chen, Shichao Song, Simin Niu, Hanyu Wang, Jiawei Yang, Chen Tang, Qingchen Yu, Jihao Zhao, Yezhaohui Wang, Peng Liu, Zehao Lin, Pengyuan Wang, Jiahao Huo, Tianyi Chen, Kai Chen, Kehang Li, et al. 2025. MemOS: A Memory OS for AI System. arXiv:2507.03724 [cs.CL] https://arxiv.org/abs/2507.03724

  24. [24]

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. InProceedings of the 62nd Annual Meeting of the Associ- ation for Computational Linguistics (Volume 1: Long Papers), Lun-Wei Ku, Andre Martins, and Vivek Srikumar (Eds.). Association for Com...

  25. [25]

    Daniel Müllner. 2011. Modern hierarchical, agglomerative clustering algorithms. arXiv:1109.2378 [stat.ML] https://arxiv.org/abs/1109.2378 Improve Large Language Model Systems with User Logs Conference, , Arxiv

  26. [26]

    Alexander Novikov, Ngân V ˜u, Marvin Eisenberger, Emilien Dupont, Po-Sen Huang, Adam Zsolt Wagner, Sergey Shirobokov, Borislav Kozlovskii, Francisco J. R. Ruiz, Abbas Mehrabian, M. Pawan Kumar, Abigail See, Swarat Chaudhuri, George Holland, Alex Davies, Sebastian Nowozin, Pushmeet Kohli, and Matej Balog. 2025. AlphaEvolve: A coding agent for scientific an...

  27. [27]

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. 2002. Bleu: a Method for Automatic Evaluation of Machine Translation. InProceedings of the 40th Annual Meeting of the Association for Computational Linguistics, Pierre Isabelle, Eugene Charniak, and Dekang Lin (Eds.). Association for Computational Linguistics, Philadelphia, Pennsylvania, USA, 31...

  28. [28]

    Manning, Stefano Ermon, and Chelsea Finn

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D. Manning, Stefano Ermon, and Chelsea Finn. 2023. Direct Preference Optimization: Your Language Model is Secretly a Reward Model. InAdvances in Neural Information Processing Systems 36: Annual Conference on Neural Information Processing Systems 2023, NeurIPS 2023, New Orleans, LA, USA, December 1...

  29. [29]

    Stephen Robertson, Hugo Zaragoza, et al . 2009. The probabilistic relevance framework: BM25 and beyond.Foundations and Trends®in Information Retrieval 3, 4 (2009), 333–389

  30. [31]

    Haizhou Shi, Zihao Xu, Hengyi Wang, Weiyi Qin, Wenyuan Wang, Yibin Wang, Zifeng Wang, Sayna Ebrahimi, and Hao Wang. 2025. Continual Learning of Large Language Models: A Comprehensive Survey.ACM Comput. Surv.(May 2025). doi:10.1145/3735633 Just Accepted

  31. [32]

    Student. 1908. The probable error of a mean.Biometrika(1908), 1–25

  32. [33]

    Weihang Su, Qian Dong, Qingyao Ai, and Yiqun Liu. 2025. Dynamic and Para- metric Retrieval Augmented Generation. InProceedings of the 2025 Annual In- ternational ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 453–458

  33. [34]

    Weihang Su, Yichen Tang, Qingyao Ai, Changyue Wang, Zhijing Wu, and Yiqun Liu. 2024. Mitigating entity-level hallucination in large language models. In Proceedings of the 2024 Annual International ACM SIGIR Conference on Research and Development in Information Retrieval in the Asia Pacific Region. 23–31

  34. [35]

    Weihang Su, Yichen Tang, Qingyao Ai, Zhijing Wu, and Yiqun Liu. 2024. DRAGIN: Dynamic Retrieval Augmented Generation based on the Real-time Information Needs of Large Language Models. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 12991–13013

  35. [36]

    Weihang Su, Yichen Tang, Qingyao Ai, Junxi Yan, Changyue Wang, Hongning Wang, Ziyi Ye, Yujia Zhou, and Yiqun Liu. 2025. Parametric retrieval augmented generation. InProceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1240–1250

  36. [37]

    Yiteng Tu, Weihang Su, Yujia Zhou, Yiqun Liu, and Qingyao Ai. 2025. Robust Fine-tuning for Retrieval Augmented Generation against Retrieval Defects. In Proceedings of the 48th International ACM SIGIR Conference on Research and Development in Information Retrieval. 1272–1282

  37. [38]

    Pablo Villalobos, Anson Ho, Jaime Sevilla, Tamay Besiroglu, Lennart Heim, and Marius Hobbhahn. 2024. Position: will we run out of data? limits of LLM scaling based on human-generated data. InProceedings of the 41st International Confer- ence on Machine Learning(Vienna, Austria)(ICML’24). JMLR.org, Article 2024, 22 pages

  38. [39]

    Leandro von Werra, Younes Belkada, Lewis Tunstall, Edward Beeching, Tris- tan Thrush, Nathan Lambert, Shengyi Huang, Kashif Rasul, and Quentin Gal- louédec. 2020. TRL: Transformer Reinforcement Learning. https://github.com/ huggingface/trl

  39. [40]

    Tianxin Wei, Noveen Sachdeva, Benjamin Coleman, Zhankui He, Yuanchen Bei, et al. 2025. Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory. arXiv:2511.20857 [cs.CL] https://arxiv.org/abs/2511.20857

  40. [41]

    Wujiang Xu, Kai Mei, Hang Gao, Juntao Tan, Zujie Liang, and Yongfeng Zhang

  41. [42]

    A-mem: Agentic memory for llm agents.arXiv preprint arXiv:2502.12110 (2025)

  42. [43]

    An Yang, Anfeng Li, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chang Gao, Chengen Huang, Chenxu Lv, Chujie Zheng, Dayiheng Liu, Fan Zhou, Fei Huang, Feng Hu, Hao Ge, Haoran Wei, et al . 2025. Qwen3 Technical Report. arXiv:2505.09388 [cs.CL] https://arxiv.org/abs/2505.09388

  43. [44]

    An Yang, Baosong Yang, Beichen Zhang, Binyuan Hui, Bo Zheng, Bowen Yu, Chengyuan Li, Dayiheng Liu, Fei Huang, Haoran Wei, Huan Lin, Jian Yang, Jianhong Tu, Jianwei Zhang, Jianxin Yang, Jiaxi Yang, Jingren Zhou, Junyang Lin, et al. 2024. Qwen2.5 Technical Report.arXiv preprint arXiv:2412.15115(2024)

  44. [45]

    Junhao Yin, Haolin Wang, Peng Bao, Ju Xu, and Yongliang Wang. 2025. From Clicks to Preference: A Multi-stage Alignment Framework for Generative Query Suggestion in Conversational System. arXiv:2508.15811 [cs.CL] https://arxiv. org/abs/2508.15811

  45. [46]

    Yunpeng Zhai, Shuchang Tao, Cheng Chen, Anni Zou, Ziqian Chen, Qingxu Fu, Shinji Mai, Li Yu, Jiaji Deng, Zouying Cao, Zhaoyang Liu, Bolin Ding, and Jingren Zhou. 2025. AgentEvolver: Towards Efficient Self-Evolving Agent System. arXiv:2511.10395 [cs.LG] https://arxiv.org/abs/2511.10395

  46. [47]

    Shengtao Zhang, Jiaqian Wang, Ruiwen Zhou, Junwei Liao, Yuchen Feng, Weinan Zhang, Ying Wen, Zhiyu Li, Feiyu Xiong, Yutao Qi, Bo Tang, and Muning Wen

  47. [48]

    MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory

    MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory. arXiv:2601.03192 [cs.CL] https://arxiv.org/abs/2601.03192

  48. [49]

    Shuai Zhang, Lina Yao, Aixin Sun, and Yi Tay. 2019. Deep Learning Based Recommender System: A Survey and New Perspectives.ACM Comput. Surv.52, 1, Article 5 (Feb. 2019), 38 pages. doi:10.1145/3285029

  49. [50]

    Yanzhao Zhang, Mingxin Li, Dingkun Long, Xin Zhang, Huan Lin, Baosong Yang, Pengjun Xie, An Yang, Dayiheng Liu, Junyang Lin, Fei Huang, and Jingren Zhou

  50. [51]

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models

    Qwen3 Embedding: Advancing Text Embedding and Reranking Through Foundation Models. arXiv:2506.05176 [cs.CL] https://arxiv.org/abs/2506.05176

  51. [52]

    Junhao Zheng, Shengjie Qiu, Chengming Shi, and Qianli Ma. 2025. Towards Lifelong Learning of Large Language Models: A Survey.ACM Comput. Surv.57, 8, Article 193 (March 2025), 35 pages. doi:10.1145/3716629