pith. machine review for the scientific record. sign in

arxiv: 2604.01707 · v2 · submitted 2026-04-02 · 💻 cs.CL · cs.DB

Recognition: 2 theorem links

· Lean Theorem

Memory in the LLM Era: Modular Architectures and Strategies in a Unified Framework

Authors on Pith no claims yet

Pith reviewed 2026-05-13 22:00 UTC · model grok-4.3

classification 💻 cs.CL cs.DB
keywords LLM agentsmemory mechanismsunified frameworklong-horizon tasksmodular architecturesbenchmark comparisonhybrid memoryagent performance
0
0 comments X

The pith

A unified framework for LLM agent memory methods shows that recombining their modules creates a hybrid system outperforming prior state-of-the-art on standard benchmarks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language model agents require memory to sustain performance across long-horizon tasks such as extended dialogues or game sequences. The paper organizes existing memory techniques into a single high-level framework that highlights their shared modular structure. It then runs a controlled comparison of representative methods on two established benchmarks, revealing clear patterns in how different components contribute to success or failure. From those patterns the authors construct a new memory approach by selecting and combining the strongest modules, and this hybrid outperforms existing leaders. The results supply a practical basis for designing more capable agent memory in future work.

Core claim

The paper presents a unified framework that captures all existing agent memory methods at a modular level. Systematic side-by-side testing on two benchmarks identifies effective and ineffective components across methods. Exploiting this analysis, the authors assemble a new memory method from the strongest modules of prior work and demonstrate that it exceeds the performance of current state-of-the-art approaches on the same benchmarks.

What carries the argument

A unified modular framework that decomposes agent memory into interchangeable components for storage, retrieval, and updating.

If this is right

  • Memory methods share reusable modules whose individual contributions can be measured separately.
  • Selecting and combining strong modules from different methods produces measurable gains over any single original method.
  • Current methods vary widely in how they handle knowledge accumulation versus iterative reasoning.
  • Future designs can target specific task demands by swapping or weighting individual memory modules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The modular view may extend to memory designs outside the LLM-agent setting, such as in classical planning systems.
  • Dynamic selection among modules at runtime could further improve results on mixed task types.
  • Resource cost differences among modules remain unmeasured and could limit deployment on smaller hardware.
  • The framework invites tests on longer or more open-ended tasks than the two benchmarks provide.

Load-bearing premise

The two chosen benchmarks capture the essential range of long-horizon tasks where memory determines agent success.

What would settle it

Evaluating the new hybrid memory method on an additional benchmark that involves multi-turn scientific discovery and observing that it no longer exceeds the previous best method would falsify the superiority claim.

Figures

Figures reproduced from arXiv: 2604.01707 by Fangyuan Zhang, Qintian Guo, Sibo Wang, Tenghui Lin, Xilin Liu, Xun Zhou, Yanchen Wu, Yingli Zhou, Yixiang Fang, Yuchi Ma.

Figure 1
Figure 1. Figure 1: Overview of naive long-context prompting and [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: An overview of the unified framework for agent [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: A sample prompt for summarization-based extrac [PITH_FULL_IMAGE:figures/full_fig_p004_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Methods of information extraction. 5 MEMORY MANAGEMENT The memory management process governs how an agent system maintains, refines, and evolves its memory over time. As illustrated in [PITH_FULL_IMAGE:figures/full_fig_p004_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Workflow of the memory management process. [PITH_FULL_IMAGE:figures/full_fig_p005_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Overall trade-off between performance and token [PITH_FULL_IMAGE:figures/full_fig_p008_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Average token costs per dialogue across sessions on [PITH_FULL_IMAGE:figures/full_fig_p009_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Robustness analysis of memory mechanisms on LONGMEMEVAL. (a) illustrates the context scalability as the input [PITH_FULL_IMAGE:figures/full_fig_p010_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Context scalability of various memory methods [PITH_FULL_IMAGE:figures/full_fig_p010_9.png] view at source ↗
Figure 11
Figure 11. Figure 11: The framework of our newly designed method. [PITH_FULL_IMAGE:figures/full_fig_p011_11.png] view at source ↗
Figure 10
Figure 10. Figure 10: Comparison of our newly designed method in [PITH_FULL_IMAGE:figures/full_fig_p011_10.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompt for answer simplification. LONGMEMEVAL categorizes memory tasks to assess the following aspects: • Information Extraction (IE): Ability to recall specific infor￾mation from extensive interactive histories, including the details mentioned by either the user (single-session-user) or the as￾sistant (single-session-assistant), and whether the model can 17 [PITH_FULL_IMAGE:figures/full_fig_p017_13.png] view at source ↗
read the original abstract

Memory emerges as the core module in the large language model (LLM)-based agents for long-horizon complex tasks (e.g., multi-turn dialogue, game playing, scientific discovery), where memory can enable knowledge accumulation, iterative reasoning and self-evolution. A number of memory methods have been proposed in the literature. However, these methods have not been systematically and comprehensively compared under the same experimental settings. In this paper, we first summarize a unified framework that incorporates all the existing agent memory methods from a high-level perspective. We then extensively compare representative agent memory methods on two well-known benchmarks and examine the effectiveness of all methods, providing a thorough analysis of those methods. As a byproduct of our experimental analysis, we also design a new memory method by exploiting modules in the existing methods, which outperforms the state-of-the-art methods. Finally, based on these findings, we offer promising future research opportunities. We believe that a deeper understanding of the behavior of existing methods can provide valuable new insights for future research.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The paper presents a unified high-level framework that subsumes existing memory methods for LLM-based agents on long-horizon tasks. It performs a systematic empirical comparison of representative methods on two benchmarks, analyzes their behavior with respect to knowledge accumulation, iterative reasoning, and self-evolution, and, as a byproduct, constructs a new composite memory method by recombining modules from prior work; this new method is reported to outperform existing SOTA approaches. The manuscript concludes with a discussion of open research opportunities.

Significance. If the reported outperformance is robust, the work is significant because it supplies the first controlled head-to-head evaluation of memory modules under identical settings and demonstrates that modular recombination can yield measurable gains. The unified framework itself offers a useful organizing lens for future agent designs, and the explicit identification of promising research directions (e.g., better handling of self-evolution) adds value beyond the empirical results.

major comments (1)
  1. [§4 (Experimental Results) and Table 2] §4 (Experimental Results) and Table 2: the central claim that the newly designed composite memory method outperforms SOTA rests on head-to-head results from exactly two benchmarks. The manuscript provides no coverage argument showing that these benchmarks exercise the full spectrum of memory operations (knowledge accumulation across multi-turn dialogue, long-horizon game playing, and scientific discovery) enumerated in the introduction; without such justification the observed gains may be benchmark-specific rather than evidence of a generally superior modular architecture.
minor comments (2)
  1. [Abstract] Abstract: the statement that experiments were conducted and a new method outperforms SOTA is given without any quantitative deltas, baseline names, or benchmark identifiers; adding these would improve readability.
  2. [§3 (Unified Framework)] §3 (Unified Framework): the high-level modular decomposition is described qualitatively; a concise table or diagram that maps each prior method to the specific modules it uses would make the framework easier to use as a reference.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the constructive feedback and the recommendation for major revision. We agree that strengthening the justification for our benchmark choices will improve the manuscript and address concerns about the generalizability of our results. We provide a point-by-point response below.

read point-by-point responses
  1. Referee: [§4 (Experimental Results) and Table 2] §4 (Experimental Results) and Table 2: the central claim that the newly designed composite memory method outperforms SOTA rests on head-to-head results from exactly two benchmarks. The manuscript provides no coverage argument showing that these benchmarks exercise the full spectrum of memory operations (knowledge accumulation across multi-turn dialogue, long-horizon game playing, and scientific discovery) enumerated in the introduction; without such justification the observed gains may be benchmark-specific rather than evidence of a generally superior modular architecture.

    Authors: We appreciate this observation and agree that an explicit coverage argument was missing. The two benchmarks were selected as representative long-horizon tasks that require knowledge accumulation and iterative reasoning (one focused on multi-turn dialogue-style interactions and the other on game-playing environments). However, we did not provide a detailed mapping to all operations listed in the introduction, including scientific discovery. In the revised manuscript, we will add a new paragraph in §4 that (1) justifies the benchmark selection based on their coverage of core memory operations, (2) includes a table mapping benchmark tasks to knowledge accumulation, iterative reasoning, and self-evolution, and (3) explicitly acknowledges that scientific discovery scenarios are not directly evaluated, framing this as a limitation and future direction. This revision will clarify the scope of our claims without requiring new experiments. revision: yes

Circularity Check

0 steps flagged

No significant circularity in empirical framework summary and benchmark comparison

full rationale

The paper summarizes prior memory methods into a high-level unified framework, runs direct empirical comparisons of representative methods on two fixed benchmarks, and constructs a new composite method by recombining observed modules from those comparisons. No mathematical derivation chain exists. No equations, fitted parameters presented as predictions, self-definitional constructs, or load-bearing self-citations appear. Central claims rest on the reported head-to-head results rather than reducing to inputs by construction. This matches the expected non-circular outcome for an empirical survey paper.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

No mathematical axioms, free parameters, or invented entities are described in the abstract; the work is an empirical unification and comparison study.

pith-pipeline@v0.9.0 · 5504 in / 904 out tokens · 34484 ms · 2026-05-13T22:00:38.299773+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

134 extracted references · 134 canonical work pages · 19 internal anchors

  1. [1]

    Josh Achiam, Steven Adler, Sandhini Agarwal, Lama Ahmad, Ilge Akkaya, Flo- rencia Leoni Aleman, Diogo Almeida, Janko Altenschmidt, Sam Altman, Shya- mal Anadkat, et al. 2023. Gpt-4 Technical Report.arXiv preprint arXiv:2303.08774 (2023)

  2. [2]

    Eric Anderson, Jonathan Fritz, Austin Lee, Bohou Li, Mark Lindblad, Henry Lindeman, Alex Meyer, Parth Parmar, Tanvi Ranade, Mehul A Shah, et al. 2024. The Design of an LLM-powered Unstructured Analytics System.arXiv preprint arXiv:2409.00847(2024)

  3. [3]

    Anthropic. 2026. Introducing Claude Sonnet 4.6: Our fastest, smartest model is now available for all. https://www.anthropic.com/news/claude-sonnet-4-6

  4. [4]

    Akari Asai, Zeqiu Wu, Yizhong Wang, Avirup Sil, and Hannaneh Hajishirzi. 2023. Self-RAG: Learning to Retrieve, Generate, and Critique Through Self-Reflection. arXiv preprint arXiv:2310.11511(2023)

  5. [5]

    Yushi Bai, Xin Lv, Jiajie Zhang, Hongchang Lyu, Jiankai Tang, Zhidian Huang, Zhengxiao Du, Xiao Liu, Aohan Zeng, Lei Hou, Yuxiao Dong, Jie Tang, and Juanzi Li. 2024. LongBench: A Bilingual, Multitask Benchmark for Long Context Understanding. InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), ...

  6. [6]

    Tom Brown, Benjamin Mann, Nick Ryder, Melanie Subbiah, Jared D Kaplan, Prafulla Dhariwal, Arvind Neelakantan, Pranav Shyam, Girish Sastry, Amanda Askell, Sandhini Agarwal, Ariel Herbert-Voss, Gretchen Krueger, Tom Henighan, Rewon Child, Aditya Ramesh, Daniel Ziegler, Jeffrey Wu, Clemens Winter, Chris Hesse, Mark Chen, Eric Sigler, Mateusz Litwin, Scott Gr...

  7. [7]

    Sibei Chen, Ju Fan, Bin Wu, Nan Tang, Chao Deng, Pengyi Wang, Ye Li, Jian Tan, Feifei Li, Jingren Zhou, et al . 2024. Automatic Database Configu- ration Debugging using Retrieval-Augmented Language Models.arXiv preprint arXiv:2412.07548(2024)

  8. [8]

    Sibei Chen, Yeye He, Weiwei Cui, Ju Fan, Song Ge, Haidong Zhang, Dongmei Zhang, and Surajit Chaudhuri. 2024. Auto-Formula: Recommend Formulas in Spreadsheets using Contrastive Learning for Table Representations.Proceedings of the ACM on Management of Data2, 3 (2024), 1–27

  9. [9]

    Sibei Chen, Nan Tang, Ju Fan, Xuemi Yan, Chengliang Chai, Guoliang Li, and Xiaoyong Du. 2023. Haipipe: Combining Human-Generated and Machine- Generated Pipelines for Data Preparation.Proceedings of the ACM on Manage- ment of Data1, 1 (2023), 1–26

  10. [10]

    Zui Chen, Lei Cao, Sam Madden, Tim Kraska, Zeyuan Shang, Ju Fan, Nan Tang, Zihui Gu, Chunwei Liu, and Michael Cafarella. 2023. SEED: Domain-Specific Data Curation With Large Language Models.arXiv e-prints(2023), arXiv–2310

  11. [11]

    Prateek Chhikara, Dev Khant, Saket Aryan, Taranjeet Singh, and Deshraj Yadav

  12. [12]

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory. https://arxiv.org/abs/2504.19413 arXiv:2504.19413

  13. [13]

    Qingxiu Dong, Lei Li, Damai Dai, Ce Zheng, Jingyuan Ma, Rui Li, Heming Xia, Jingjing Xu, Zhiyong Wu, Tianyu Liu, Baobao Chang, Xu Sun, Lei Li, and Zhifang Sui. 2024. A Survey on In-Context Learning.arXiv preprint arXiv:2301.00234(2024). https://arxiv.org/abs/2301.00234

  14. [14]

    Matthijs Douze, Alexandr Guzhva, Chengqi Deng, Jeff Johnson, Gergely Szilvasy, Pierre-Emmanuel Mazaré, Maria Lomeli, Lucas Hosseini, and Hervé Jégou. 2025. The Faiss Library. arXiv:2401.08281 [cs.LG] https://arxiv.org/abs/2401.08281

  15. [15]

    Yiming Du, Wenyu Huang, Danna Zheng, Zhaowei Wang, Sebastien Mon- tella, Mirella Lapata, Kam-Fai Wong, and Jeff Z. Pan. 2025. Rethinking Mem- ory in LLM based Agents: Representations, Operations, and Emerging Topics. arXiv:2505.00675 [cs.CL] https://arxiv.org/abs/2505.00675

  16. [16]

    Darren Edge, Ha Trinh, Newman Cheng, Joshua Bradley, Alex Chao, Apurva Mody, Steven Truitt, and Jonathan Larson. 2024. From Local to Global: A Graph RAG Approach to Query-Focused Summarization.arXiv preprint arXiv:2404.16130(2024)

  17. [17]

    Ju Fan, Zihui Gu, Songyue Zhang, Yuxin Zhang, Zui Chen, Lei Cao, Guoliang Li, Samuel Madden, Xiaoyong Du, and Nan Tang. 2024. Combining small language models and large language models for zero-shot nl2sql.Proceedings of the VLDB Endowment17, 11 (2024), 2750–2763

  18. [18]

    Dawei Gao, Haibin Wang, Yaliang Li, Xiuyu Sun, Yichen Qian, Bolin Ding, and Jingren Zhou. 2024. Text-to-SQL Empowered by Large Language Models: A Benchmark Evaluation.Proceedings of the VLDB Endowment17, 5 (Jan. 2024), 1132–1145. https://doi.org/10.14778/3641204.3641221

  19. [19]

    Yunfan Gao, Yun Xiong, Xinyu Gao, Kangxiang Jia, Jinliu Pan, Yuxi Bi, Yi Dai, Jiawei Sun, and Haofen Wang. 2023. Retrieval-Augmented Generation for Large Language Models: A Survey.arXiv preprint arXiv:2312.10997(2023)

  20. [20]

    Aashish Ghimire, James Prather, and John Edwards. 2024. Generative AI in Education: A Study of Educators’ Awareness, Sentiments, and Influencing Factors.arXiv preprint arXiv:2403.15586(2024)

  21. [21]

    Victor Giannankouris and Immanuel Trummer. 2024. {\lambda}-Tune: Har- nessing Large Language Models for Automated Database System Tuning.arXiv preprint arXiv:2411.03500(2024)

  22. [22]

    Alex Graves, Greg Wayne, Malcolm Reynolds, Tim Harley, Ivo Danihelka, Agnieszka Grabska-Barwińska, Sergio Gómez Colmenarejo, Edward Grefen- stette, Tiago Ramalho, John Agapiou, Adrià Puigdomènech Badia, Karl Moritz Hermann, Yori Zwols, Georg Ostrovski, Adam Cain, Helen King, Christopher Summerfield, Phil Blunsom, Koray Kavukcuoglu, and Demis Hassabis. 201...

  23. [23]

    Zirui Guo, Lianghao Xia, Yanhua Yu, Tu Ao, and Chao Huang. 2024. LightRAG: Simple and Fast Retrieval-Augmented Generation.arXiv e-prints(2024), arXiv– 2410

  24. [24]

    Bernal Jimenez Gutierrez, Yiheng Shu, Yu Gu, Michihiro Yasunaga, and Yu Su

  25. [25]

    InThe Thirty-eighth Annual Conference on Neural Information Processing Systems

    HippoRAG: Neurobiologically Inspired Long-Term Memory for Large Language Models. InThe Thirty-eighth Annual Conference on Neural Information Processing Systems. https://openreview.net/forum?id=hkujvAPVsg

  26. [26]

    Haoyu Han, Yu Wang, Harry Shomer, Kai Guo, Jiayuan Ding, Yongjia Lei, Mahantesh Halappanavar, Ryan A Rossi, Subhabrata Mukherjee, Xianfeng Tang, et al. 2024. Retrieval-Augmented Generation with Graphs (GraphRAG). arXiv preprint arXiv:2501.00309(2024)

  27. [27]

    Sirui Hong, Yizhang Lin, Bang Liu, Bangbang Liu, Binhao Wu, Ceyao Zhang, Danyang Li, Jiaqi Chen, Jiayi Zhang, Jinlin Wang, Li Zhang, Lingyao Zhang, Min Yang, Mingchen Zhuge, Taicheng Guo, Tuo Zhou, Wei Tao, Robert Tang, Xiangtao Lu, Xiawu Zheng, Xinbing Liang, Yaying Fei, Yuheng Cheng, Yongxin Ni, Zhibin Gou, Zongze Xu, Yuyu Luo, and Chenglin Wu. 2025. Da...

  28. [28]

    Cheng-Ping Hsieh, Simeng Sun, Samuel Kriman, Shantanu Acharya, Dima Rekesh, Fei Jia, and Boris Ginsburg. 2024. RULER: What’s the Real Context Size of Your Long-Context Language Models?. InProceedings of the First Conference on Language Modeling (COLM)

  29. [29]

    Chenxu Hu, Jie Fu, Chenzhuang Du, Simian Luo, Junbo Zhao, and Hang Zhao

  30. [30]

    Chatdb: Augmenting llms with databases as their symbolic memory

    ChatDB: Augmenting LLMs with Databases as Their Symbolic Memory. arXiv:2306.03901 [cs.AI] https://arxiv.org/abs/2306.03901

  31. [31]

    Mengkang Hu, Tianxing Chen, Qiguang Chen, Yao Mu, Wenqi Shao, and Ping Luo. 2025. HiAgent: Hierarchical Working Memory Management for Solving Long-Horizon Agent Tasks with Large Language Model. InProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), Wanxiang Che, Joyce Nabende, Ekaterina Shutova, ...

  32. [32]

    Lei Huang, Weijiang Yu, Weitao Ma, Weihong Zhong, Zhangyin Feng, Haotian Wang, Qianglong Chen, Weihua Peng, Xiaocheng Feng, Bing Qin, et al. 2023. A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions.arXiv preprint arXiv:2311.05232(2023)

  33. [33]

    Hervé Jégou, Matthijs Douze, and Cordelia Schmid. 2011. Product Quantization for Nearest Neighbor Search.IEEE Transactions on Pattern Analysis and Machine Intelligence33, 1 (2011), 117–128. https://doi.org/10.1109/TPAMI.2010.57

  34. [34]

    Soyeong Jeong, Jinheon Baek, Sukmin Cho, Sung Ju Hwang, and Jong C Park

  35. [35]

    Adaptive-rag: Learning to adapt retrieval-augmented large language models through question complexity.arXiv preprint arXiv:2403.14403(2024)

  36. [36]

    Jiazheng Kang, Mingming Ji, Zhe Zhao, and Ting Bai. 2025. Memory OS of AI Agent.arXiv preprint arXiv:2506.06326(2025)

  37. [37]

    Omar Khattab, Arnav Singhvi, Paridhi Maheshwari, Zhiyuan Zhang, Keshav Santhanam, Sri Vardhamanan, Saiful Haq, Ashutosh Sharma, Thomas T Joshi, Hanna Moazam, et al. 2023. DSPy: Compiling Declarative Language Model Calls into Self-Improving Pipelines.arXiv preprint arXiv:2310.03714(2023)

  38. [38]

    Langchain. 2023. Langchain. https://python.langchain.com/docs/additional_ resources/arxiv_references/

  39. [39]

    Jiale Lao, Yibo Wang, Yufei Li, Jianping Wang, Yunjia Zhang, Zhiyuan Cheng, Wanghu Chen, Mingjie Tang, and Jianguo Wang. 2024. Gptuner: A manual- reading database tuning system via gpt-guided bayesian optimization.Proceed- ings of the VLDB Endowment17, 8 (2024), 1939–1952

  40. [40]

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, Fabio Petroni, Vladimir Karpukhin, Naman Goyal, Heinrich Küttler, Mike Lewis, Wen-tau Yih, Tim Rocktäschel, Sebastian Riedel, and Douwe Kiela. 2020. Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks. InAdvances in Neural Infor- mation Processing Systems (NeurIPS), Vol. 33. 9459–9474

  41. [41]

    Dawei Li, Shu Yang, Zhen Tan, Jae Young Baik, Sukwon Yun, Joseph Lee, Aaron Chacko, Bojian Hou, Duy Duong-Tran, Ying Ding, et al. 2024. DALK: Dynamic Co-Augmentation of LLMs and KG to answer Alzheimer’s Disease Questions with Scientific Literature.arXiv preprint arXiv:2405.04819(2024). 14

  42. [42]

    Hongxin Li, Jingran Su, Yuntao Chen, Qing Li, and Zhao-Xiang Zhang. 2023. SheetCopilot: Bringing Software Productivity to the Next Level Through Large Language Models. InAdvances in Neural Information Processing Systems, Vol. 36. 4952–4984

  43. [43]

    Lan Li, Liri Fang, Bertram Ludäscher, and Vetle I Torvik. 2025. AutoDCWork- flow: LLM-based Data Cleaning Workflow Auto-Generation and Benchmark. InFindings of the Association for Computational Linguistics: EMNLP 2025, Chris- tos Christodoulopoulos, Tanmoy Chakraborty, Carolyn Rose, and Violet Peng (Eds.). Association for Computational Linguistics, Suzhou...

  44. [44]

    Xinzhe Li. 2024. A Review of Prominent Paradigms for LLM-Based Agents: Tool Use (Including RAG), Planning, and Feedback Learning. arXiv:2406.05804 [cs.AI] https://arxiv.org/abs/2406.05804

  45. [45]

    Yucheng Li, Bo Dong, Frank Guerin, and Chenghua Lin. 2023. Compressing Con- text to Enhance Inference Efficiency of Large Language Models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, Houda Bouamor, Juan Pino, and Kalika Bali (Eds.). Association for Computational Linguistics, Singapore, 6342–6353. https://doi.o...

  46. [46]

    Yiyan Li, Haoyang Li, Jing Zhang, Renata Borovica-Gajic, Shuai Wang, Tieying Zhang, Jianjun Chen, Rui Shi, Cuiping Li, and Hong Chen. 2025. AgentTune: An Agent-Based Large Language Model Framework for Database Knob Tuning. Proc. ACM Manag. Data3, 6, Article 293 (Dec. 2025), 29 pages. https://doi.org/ 10.1145/3769758

  47. [47]

    Yinheng Li, Shaofei Wang, Han Ding, and Hang Chen. 2023. Large Language Models in Finance: A Survey. InProceedings of the fourth ACM international conference on AI in finance. 374–382

  48. [48]

    Zhiyu Li, Shichao Song, Chenyang Xi, Hanyu Wang, Chen Tang, Simin Niu, Ding Chen, Jiawei Yang, Chunyu Li, Qingchen Yu, et al. 2025. Memos: A memory os for ai system.arXiv preprint arXiv:2507.03724(2025)

  49. [49]

    Zhaodonghui Li, Haitao Yuan, Huiming Wang, Gao Cong, and Lidong Bing

  50. [50]

    LLM-R2: A Large Language Model Enhanced Rule-based Rewrite System for Boosting Query Efficiency.Proceedings of the VLDB Endowment1, 18 (2025), 53–65

  51. [51]

    Chen Liang, Donghua Yang, Zheng Liang, Zhiyu Liang, Tianle Zhang, Boyu Xiao, Yuqing Yang, Wenqi Wang, and Hongzhi Wang. 2025. Revisiting Data Analysis with Pre-trained Foundation Models.arXiv preprint arXiv:2501.01631 (2025)

  52. [52]

    Yiming Lin, Mawil Hasan, Rohan Kosalge, Alvin Cheung, and Aditya G Parameswaran. 2025. TWIX: Automatically Reconstructing Structured Data from Templatized Documents.arXiv preprint arXiv:2501.06659(2025)

  53. [53]

    Yiming Lin, Madelon Hulsebos, Ruiying Ma, Shreya Shankar, Sepanta Zeigham, Aditya G Parameswaran, and Eugene Wu. 2024. Towards Accurate and Ef- ficient Document Analytics with Large Language Models.arXiv preprint arXiv:2405.04674(2024)

  54. [54]

    Chunwei Liu, Matthew Russo, Michael Cafarella, Lei Cao, Peter Baille Chen, Zui Chen, Michael Franklin, Tim Kraska, Samuel Madden, and Gerardo Vitagliano

  55. [55]

    A Declarative System for Optimizing AI Workloads.arXiv preprint arXiv:2405.14696(2024)

  56. [56]

    Chunwei Liu, Gerardo Vitagliano, Brandon Rose, Matt Prinz, David Andrew Samson, and Michael Cafarella. 2025. PalimpChat: Declarative and Interactive AI Analytics.arXiv preprint arXiv:2502.03368(2025)

  57. [57]

    Lei Liu, Xiaoyan Yang, Junchi Lei, Xiaoyang Liu, Yue Shen, Zhiqiang Zhang, Peng Wei, Jinjie Gu, Zhixuan Chu, Zhan Qin, et al. 2024. A Survey on Medical Large Language Models: Technology, Application, Trustworthiness, and Future Directions.arXiv preprint arXiv:2406.03712(2024)

  58. [58]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. 2024. Lost in the Middle: How Language Models Use Long Contexts.Transactions of the Association for Computational Linguistics 12 (2024), 157–173

  59. [59]

    Peiqi Liu, Zhanqiu Guo, Mohit Warke, Soumith Chintala, Chris Paxton, Nur Muhammad Mahi Shafiullah, and Lerrel Pinto. 2025. DynaMem: Online Dy- namic Spatio-Semantic Memory for Open World Mobile Manipulation. InICRA 2025 Workshop: Human-Centered Robot Learning in the Era of Big Data and Large Models. https://openreview.net/forum?id=RJKUIhDJg1

  60. [60]

    Pengfei Liu, Weizhe Yuan, Jinlan Fu, Zhengbao Jiang, Hiroaki Hayashi, and Graham Neubig. 2021. Pre-train, Prompt, and Predict: A Systematic Survey of Prompting Methods in Natural Language Processing. arXiv:2107.13586 [cs.CL] https://arxiv.org/abs/2107.13586

  61. [61]

    llamaindex. 2023. llamaindex. https://www.llamaindex.ai/

  62. [62]

    Junru Lu, Siyu An, Mingbao Lin, Gabriele Pergola, Yulan He, Di Yin, Xing Sun, and Yunsheng Wu. 2023. MemoChat: Tuning LLMs to Use Memos for Consis- tent Long-Range Open-Domain Conversation.arXiv preprint arXiv:2308.08239 (2023)

  63. [63]

    Adyasha Maharana, Dong-Ho Lee, Sergey Tulyakov, Mohit Bansal, Francesco Barbieri, and Yuwei Fang. 2024. Evaluating Very Long-Term Conversational Memory of LLM Agents. https://arxiv.org/abs/2402.17753 arXiv:2402.17753

  64. [64]

    Yu A Malkov and Dmitry A Yashunin. 2018. Efficient and Robust Approximate Nearest Neighbor Search Using Hierarchical Navigable Small World Graphs. IEEE transactions on pattern analysis and machine intelligence42, 4 (2018), 824– 836

  65. [65]

    Zan Ahmad Naeem, Mohammad Shahmeer Ahmad, Mohamed Eltabakh, Mourad Ouzzani, and Nan Tang. 2024. RetClean: Retrieval-Based Data Cleaning Using LLMs and Data Lakes.Proceedings of the VLDB Endowment17, 12 (2024), 4421–4424

  66. [66]

    Avanika Narayan, Ines Chami, Laurel Orr, and Christopher Ré. 2022. Can Foundation Models Wrangle Your Data?Proceedings of the VLDB Endowment 16, 4 (2022), 738–746

  67. [67]

    NebulaGraph. 2024. NebulaGraph. https://nebula-graph.io/

  68. [68]

    Neo4j. 2006. Neo4j. https://neo4j.com/

  69. [69]

    Yuqi Nie, Yaxuan Kong, Xiaowen Dong, John M Mulvey, H Vincent Poor, Qing- song Wen, and Stefan Zohren. 2024. A Survey of Large Language Models for Financial Applications: Progress, Prospects and Challenges.arXiv preprint arXiv:2406.11903(2024)

  70. [70]

    OpenClaw Team. 2026. OpenClaw: Your own personal AI assistant. https: //github.com/openclaw/openclaw

  71. [71]

    Long Ouyang, Jeffrey Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishkin, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul Christiano, Jan Leike, and Ryan Lowe

  72. [72]

    InAdvances in Neural Information Processing Systems, Vol

    Training Language Models to Follow Instructions with Human Feedback. InAdvances in Neural Information Processing Systems, Vol. 35. 27730–27744

  73. [73]

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G Patil, Ion Stoica, and Joseph E Gonzalez. 2023. MemGPT: Towards LLMs as Operating Systems.arXiv preprint arXiv:2310.08560(2023)

  74. [74]

    O’Brien, Carrie J

    Joon Sung Park, Joseph O’Brien, Carrie Jun Cai, Meredith Ringel Morris, Percy Liang, and Michael S. Bernstein. 2023. Generative Agents: Interactive Simulacra of Human Behavior. InProceedings of the 36th Annual ACM Symposium on User Interface Software and Technology (UIST ’23). Association for Computing Machinery. https://doi.org/10.1145/3586183.3606763

  75. [75]

    Liana Patel, Siddharth Jha, Carlos Guestrin, and Matei Zaharia. 2024. Lotus: Enabling semantic queries with llms over tables of unstructured and structured data.arXiv preprint arXiv:2407.11418(2024)

  76. [76]

    Boci Peng, Yun Zhu, Yongchao Liu, Xiaohe Bo, Haizhou Shi, Chuntao Hong, Yan Zhang, and Siliang Tang. 2024. Graph Retrieval-Augmented Generation: A Survey.arXiv preprint arXiv:2408.08921(2024)

  77. [77]

    Vijay Putta, Krishna Teja Areti, Ajay Guyyala, and Prudhvi Ratna Badri Satya

  78. [78]

    https: //doi.org/10.5120/ijca2026926236

    Self-Reflective Memory Consolidation in Agentic Architectures.In- ternational Journal of Computer Applications187, 73 (Jan 2026), 1–14. https: //doi.org/10.5120/ijca2026926236

  79. [79]

    Yichen Qian, Yongyi He, Rong Zhu, Jintao Huang, Zhijian Ma, Haibin Wang, Yaohua Wang, Xiuyu Sun, Defu Lian, Bolin Ding, et al. 2024. UniDM: A Unified Framework for Data Manipulation with Large Language Models.Proceedings of Machine Learning and Systems6 (2024), 465–482

  80. [80]

    Preston Rasmussen, Pavlo Paliychuk, Travis Beauvais, Jack Ryan, and Daniel Chalef. 2025. Zep: A Temporal Knowledge Graph Architecture for Agent Memory.arXiv preprint arXiv:2501.13956(2025)

Showing first 80 references.