pith. machine review for the scientific record. sign in

arxiv: 2605.08374 · v3 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:44 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM agentsepisodic memoryQ-learningprovenance DAGTD(lambda)memory retrievalExogenous-Context MDP
0
0 comments X

The pith

MemQ propagates Q-learning credit through provenance DAGs so memories that enable later ones receive updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MemQ treats episodic memory in LLM agents as a connected structure rather than isolated items. It records creation dependencies in a provenance DAG and applies TD(lambda) eligibility traces so that credit for a successful outcome flows backward along those chains with decay based on graph depth. This replaces simple temporal distance with structural proximity in an Exogenous-Context MDP that separates task dynamics from the memory store. On six benchmarks the method records the highest success rates in both generalization and online learning, with the biggest lifts appearing on multi-step problems that build long relevant chains.

Core claim

By maintaining a provenance DAG that records which memories were retrieved to create each new memory and then running TD(lambda) updates on memory Q-values, credit is assigned as (gamma lambda)^d where d is DAG depth. This structural propagation improves downstream task success compared with independent memory updates, with the largest measured gains on multi-step tasks that produce deep chains.

What carries the argument

Provenance DAG carrying TD(lambda) eligibility traces that decay credit for memory Q-values according to structural depth rather than temporal distance.

If this is right

  • Multi-step tasks with long relevant memory chains receive the largest performance lift.
  • Single-step classification tasks see only marginal improvement because independent updates already suffice.
  • Guidance for choosing gamma and lambda follows directly from the factored structure of the Exogenous-Context MDP.
  • Memory stores evolve more coherently because enabling memories receive credit for later successes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same DAG-based credit mechanism could be tested in non-LLM memory systems such as standard RL agents that maintain experience graphs.
  • If provenance construction contains errors, the credit assignment may reinforce spurious dependencies, suggesting a need for uncertainty-aware DAG edges.
  • The approach supplies a concrete way to import causal-graph ideas into memory-augmented agents without requiring full causal discovery.

Load-bearing premise

The automatically built provenance DAG correctly encodes the causal dependencies between memories so that propagating credit along it actually improves task performance.

What would settle it

Running the identical six benchmarks with standard independent Q-updates on memories and obtaining statistically identical success rates would show the DAG propagation adds no value.

Figures

Figures reproduced from arXiv: 2605.08374 by Bo Tang, Feiyu Xiong, Haoting Shi, Jiaqian Wang, Junwei Liao, Muning Wen, Ruiwen Zhou, Shengtao Zhang, Weinan Zhang, Wei Zhang, Ying Wen, Zhiyu Li.

Figure 1
Figure 1. Figure 1: High-level and conceptual illustration of MemQ. [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: The EC-MDP. The state factors into an exogenous task stream [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: MemQ Framework Overview. The continuous learning loop features three stages: Retrieve: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Success rate under different γ. provenance DAG: the effective credit reach is governed by (γλ) d , where d is the DAG depth (Eq. 6). Yet γ and λ play fundamentally different roles: γ controls the structural horizon by weighting the bootstrap target γQ(mnew) (Eq. 5), while λ controls the empirical horizon by decaying how far each observed TD error propagates (Eq. 6). We sweep each hyperparameter individuall… view at source ↗
Figure 5
Figure 5. Figure 5: SR, TD error, TD variance, and TD bias under different [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Runtime learning dynamics (success rate vs. epoch) across six benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Cumulative success rate (CSR) over epochs across six benchmarks, complementing the [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: TD error under different γ on LiveCodeBench [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: TD error under different γ on BFCL. B.2 λ Ablation on BFCL [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: SR, TD error, TD variance, and TD bias under different [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: SR (top row) and TD error (bottom row) under different [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
read the original abstract

Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($\lambda$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(\gamma\lambda)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $\gamma$ and $\lambda$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao-ai/MemQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper introduces MemQ, which augments episodic memory in LLM agents by applying TD(λ) eligibility traces to Q-values over a provenance DAG. The DAG records retrieval dependencies at memory creation time, replacing temporal distance with structural depth d in the decay (γλ)^d. The setting is formalized as an Exogenous-Context MDP (EC-MDP) whose transition factors exogenous task streams from the endogenous memory store. Across six benchmarks (OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, expert QA), MemQ reports the highest success rates in both generalization and runtime learning, with gains largest on multi-step tasks producing deep provenance chains (up to +5.7 pp) and smallest on single-step tasks (+0.77 pp). Parameter studies on γ and λ are provided, and code is released.

Significance. If the central results hold, the work supplies a concrete mechanism for structural credit assignment in self-evolving memory systems, moving beyond independent per-memory updates. The largest reported gains on multi-step tasks align with the motivation that provenance chains matter precisely when dependencies are deep. Code release supports direct verification and extension.

major comments (3)
  1. [Abstract and §3] Abstract and §3 (Provenance DAG construction): edges are defined solely by which memories the current retrieval policy selected at creation time. This makes the DAG a record of the agent's own past decisions rather than an independently validated causal graph. Because TD(λ) credit then propagates exactly along these edges, any performance gain could arise from amplifying existing retrieval biases instead of discovering enabling dependencies. A controlled comparison against a similarity-based or random DAG baseline is required to isolate the contribution of the structural credit rule.
  2. [Experimental evaluation (results tables and §5)] Experimental evaluation (results tables and §5): the abstract states consistent outperformance on all six benchmarks with specific percentage-point gains, yet no mention is made of the number of independent runs, standard errors, or statistical significance tests. Without these, it is impossible to determine whether the reported margins (especially the smaller +0.77 pp on single-step tasks) exceed run-to-run variance.
  3. [§4] §4 (EC-MDP formalization and parameter study): γ and λ are treated as free parameters whose interaction with DAG depth is analyzed, but the manuscript does not report whether the chosen values were tuned on a held-out validation split or selected after observing test performance. If the latter, the claimed “principled guidance” risks being post-hoc.
minor comments (2)
  1. [Abstract] The notation for the decay factor (γλ)^d is introduced in the abstract but would benefit from an explicit equation number and a short derivation showing how it replaces the usual temporal eligibility trace.
  2. [Figures] Figure captions for the provenance DAG examples should include the exact retrieval policy and memory-creation timestamps used to generate the illustrated edges.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate the suggested improvements where they strengthen the work.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3 (Provenance DAG construction): edges are defined solely by which memories the current retrieval policy selected at creation time. This makes the DAG a record of the agent's own past decisions rather than an independently validated causal graph. Because TD(λ) credit then propagates exactly along these edges, any performance gain could arise from amplifying existing retrieval biases instead of discovering enabling dependencies. A controlled comparison against a similarity-based or random DAG baseline is required to isolate the contribution of the structural credit rule.

    Authors: The provenance DAG is intentionally constructed from the agent's actual retrieval decisions at memory creation time, as this directly models the dependency chains through which memories enable subsequent ones in the self-evolving process. This is not intended as an external causal graph but as a record of the agent's endogenous memory dynamics within the EC-MDP. To isolate the contribution of the TD(λ) structural credit rule from potential retrieval bias amplification, we have added a new controlled experiment in the revised §5 comparing MemQ against both a random DAG baseline and a similarity-based DAG baseline. The results confirm that the observed gains are attributable to the eligibility trace propagation along the provenance structure rather than bias reinforcement alone. revision: yes

  2. Referee: [Experimental evaluation (results tables and §5)] Experimental evaluation (results tables and §5): the abstract states consistent outperformance on all six benchmarks with specific percentage-point gains, yet no mention is made of the number of independent runs, standard errors, or statistical significance tests. Without these, it is impossible to determine whether the reported margins (especially the smaller +0.77 pp on single-step tasks) exceed run-to-run variance.

    Authors: We agree that reporting run statistics is essential for interpreting the results. The original submission reported mean success rates across the six benchmarks but omitted the experimental protocol details. In the revised manuscript, we have updated all result tables in §5 to include the number of independent runs (five per benchmark), standard errors, and the outcomes of paired t-tests against the strongest baseline. The reported margins, including the smaller gains on single-step tasks, remain statistically significant at p < 0.05. revision: yes

  3. Referee: [§4] §4 (EC-MDP formalization and parameter study): γ and λ are treated as free parameters whose interaction with DAG depth is analyzed, but the manuscript does not report whether the chosen values were tuned on a held-out validation split or selected after observing test performance. If the latter, the claimed “principled guidance” risks being post-hoc.

    Authors: The parameter values for γ and λ were selected via grid search on a held-out validation split drawn from the training task distributions, prior to any test-set evaluation. We have revised §4 to explicitly state this procedure and to include the validation performance curves that motivated the final choices, thereby clarifying that the guidance is not post-hoc. revision: yes

Circularity Check

0 steps flagged

No circularity; standard TD(λ) applied to novel provenance DAG with independent empirical results

full rationale

The derivation applies the standard TD(λ) eligibility trace update to memory Q-values, with credit decaying as (γλ)^d over DAG depth d instead of time. The provenance DAG is constructed from the agent's retrieval decisions at memory creation time, but this construction is an input to the algorithm rather than a fitted parameter whose output is then renamed as a prediction. Gamma and lambda are studied as hyperparameters with guidance provided, not tuned post-hoc to force benchmark gains. The central claims consist of empirical success rates on six external benchmarks (with largest gains on multi-step tasks), not a first-principles mathematical result that reduces to the inputs by construction. The Exogenous-Context MDP formalization supplies a modeling framework without creating a self-referential loop. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems are invoked in the load-bearing steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The approach rests on standard RL components plus two new modeling constructs whose independent validation is not provided in the abstract.

free parameters (2)
  • gamma
    Discount factor in the TD update; its interaction with the EC-MDP is studied but values are chosen per experiment.
  • lambda
    Eligibility trace decay parameter controlling credit propagation depth.
axioms (2)
  • domain assumption Memory dependencies can be faithfully recorded as a directed acyclic graph (provenance DAG).
    Assumed when defining how credit propagates with depth d.
  • domain assumption The agent environment factors into an exogenous task stream and an endogenous memory store (EC-MDP).
    Used to justify the decoupled transition model.
invented entities (2)
  • Provenance DAG no independent evidence
    purpose: Data structure recording retrieval dependencies between memories.
    Newly introduced to enable structural credit assignment.
  • Exogenous-Context MDP (EC-MDP) no independent evidence
    purpose: Formal model separating task dynamics from memory evolution.
    New formalization introduced in the paper.

pith-pipeline@v0.9.0 · 5576 in / 1503 out tokens · 49138 ms · 2026-05-15T05:44:51.984125+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

  1. [1]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Large Language Models Are Semi-Parametric Reinforcement Learning Agents , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  2. [2]

    2026 , eprint=

    Memento 2: Learning by Stateful Reflective Memory , author=. 2026 , eprint=

  3. [3]

    2026 , eprint=

    MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory , author=. 2026 , eprint=

  4. [4]

    O’Brien, Carrie J

    Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , title =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , articleno =. 2023 , isbn =. doi:10.1145/3586183.3606763 , abstract =

  5. [5]

    2024 , eprint=

    MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

  6. [6]

    Transactions on Machine Learning Research , issn=

    Cognitive Architectures for Language Agents , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

  7. [7]

    Thirty-seventh Conference on Neural Information Processing Systems , year=

    Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

  8. [8]

    Zhao, Andrew and Huang, Daniel and Xu, Quentin and Lin, Matthieu and Liu, Yong-Jin and Huang, Gao , title =. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence , articleno =....

  9. [9]

    Transactions on Machine Learning Research , issn=

    Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

  10. [10]

    Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , title =. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence , articleno =. 2024 , isbn =. doi:10....

  11. [11]

    2023 , eprint=

    Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory , author=. 2023 , eprint=

  12. [12]

    2023 , eprint=

    RecallM: An Adaptable Memory Mechanism with Temporal Understanding for Large Language Models , author=. 2023 , eprint=

  13. [13]

    2016 , eprint=

    Model-Free Episodic Control , author=. 2016 , eprint=

  14. [14]

    Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =

    Lin, Zichuan and Zhao, Tianqi and Yang, Guangwen and Zhang, Lintao , title =. Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =. 2018 , isbn =

  15. [15]

    and Singh, Satinder P

    Kearns, Michael J. and Singh, Satinder P. , title =. Proceedings of the Thirteenth Annual Conference on Computational Learning Theory , pages =. 2000 , isbn =

  16. [16]

    2023 , eprint=

    A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants , author=. 2023 , eprint=

  17. [17]

    and Singh, Satinder P

    Jaakkola, Tommi and Jordan, Michael I. and Singh, Satinder P. , title =. Neural Comput. , month = nov, pages =. 1994 , issue_date =. doi:10.1162/neco.1994.6.6.1185 , abstract =

  18. [18]

    Asynchronous Stochastic Approximation and

    Tsitsiklis, John N , journal=. Asynchronous Stochastic Approximation and

  19. [19]

    Borkar, V. S. and Meyn, S. P. , title =. SIAM J. Control Optim. , month = jan, pages =. 2000 , issue_date =. doi:10.1137/S0363012997331639 , abstract =

  20. [20]

    Neuro-Dynamic Programming , author=

  21. [21]

    and Van Roy, B

    Tsitsiklis, J.N. and Van Roy, B. , journal=. An analysis of temporal-difference learning with function approximation , year=

  22. [22]

    The Annals of Mathematical Statistics , volume=

    A Stochastic Approximation Method , author=. The Annals of Mathematical Statistics , volume=

  23. [23]

    Machine Learning , volume=

    Learning to Predict by the Methods of Temporal Differences , author=. Machine Learning , volume=

  24. [24]

    and Sutton, Richard S

    Singh, Satinder P. and Sutton, Richard S. , title =. 1996 , issue_date =. doi:10.1007/BF00114726 , journal =

  25. [25]

    Reinforcement Learning: An Introduction , author=

  26. [26]

    Proceedings of the 31st International Conference on Machine Learning , pages =

    True Online TD(lambda) , author =. Proceedings of the 31st International Conference on Machine Learning , pages =. 2014 , editor =

  27. [27]

    Learning from Delayed Rewards , author=

  28. [28]

    Incremental Multi-Step

    Peng, Jing and Williams, Ronald J , journal=. Incremental Multi-Step

  29. [29]

    Advances in Neural Information Processing Systems , volume=

    Safe and Efficient Off-Policy Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

  30. [30]

    Advances in Neural Information Processing Systems , volume=

    Reconciling -Returns with Experience Replay , author=. Advances in Neural Information Processing Systems , volume=

  31. [31]

    International Conference on Learning Representations , year=

    High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. International Conference on Learning Representations , year=

  32. [32]

    International Conference on Machine Learning , year=

    Espeholt, Lasse and Soyer, Hubert and Munos, R. International Conference on Machine Learning , year=

  33. [33]

    2014 , eprint=

    Neural Turing Machines , author=. 2014 , eprint=

  34. [34]

    Proceedings of the 34th International Conference on Machine Learning , pages =

    Neural Episodic Control , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

  35. [35]

    International Conference on Learning Representations , year=

    Prioritized Experience Replay , author=. International Conference on Learning Representations , year=

  36. [36]

    Proceedings of the 37th International Conference on Machine Learning , articleno =

    Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

  37. [37]

    Yu Wang and Ryuichi Takanobu and Zhiqi Liang and Yuzhen Mao and Yuanzhe Hu and Julian McAuley and Xiaojian Wu , year=. Mem-. 2509.25911 , archivePrefix=

  38. [38]

    arXiv preprint arXiv:2505.00000 , year=

    Yan, Sikuan and Yang, Xiufeng and Huang, Zuchao and Nie, Ercong and Ding, Zifeng and Li, Zonggen and Ma, Xiaowen and Bi, Jinhe and Kersting, Kristian and Pan, Jeff Z and Sch. arXiv preprint arXiv:2505.00000 , year=

  39. [39]

    2026 , eprint=

    Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks , author=. 2026 , eprint=

  40. [40]

    2026 , eprint=

    Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management , author=. 2026 , eprint=

  41. [41]

    2026 , eprint=

    MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards , author=. 2026 , eprint=

  42. [42]

    2026 , eprint=

    RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback , author=. 2026 , eprint=

  43. [43]

    2026 , eprint=

    RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization , author=. 2026 , eprint=

  44. [44]

    2026 , eprint=

    ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation , author=. 2026 , eprint=

  45. [45]

    The Thirteenth International Conference on Learning Representations , year=

    BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , author=. The Thirteenth International Conference on Learning Representations , year=

  46. [46]

    , booktitle =

    Patil, Shishir G and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie Cheng-Jie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E. , booktitle =. The Berkeley Function Calling Leaderboard (. 2025 , editor =

  47. [47]

    Bowman , booktitle=

    David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

  48. [48]

    2026 , publisher =

    Gemma Team, Google , title =. 2026 , publisher =

  49. [49]

    2025 , eprint=

    Gemini Robotics: Bringing AI into the Physical World , author=. 2025 , eprint=

  50. [50]

    Akari Asai and Zeqiu Wu and Yizhong Wang and Avirup Sil and Hannaneh Hajishirzi , booktitle=. Self-. 2024 , url=

  51. [51]

    2026 , eprint=

    Memp: Exploring Agent Procedural Memory , author=. 2026 , eprint=

  52. [52]

    2025 , eprint=

    Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. 2025 , eprint=

  53. [53]

    2025 , eprint=

    Memento: Fine-tuning LLM Agents without Fine-tuning LLMs , author=. 2025 , eprint=

  54. [54]

    2026 , eprint=

    What Deserves Memory: Adaptive Memory Distillation for LLM Agents , author=. 2026 , eprint=

  55. [55]

    2026 , eprint=

    MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. 2026 , eprint=

  56. [56]

    2026 , eprint=

    Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback , author=. 2026 , eprint=

  57. [57]

    2026 , eprint=

    HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents , author=. 2026 , eprint=

  58. [58]

    2026 , eprint=

    Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution , author=. 2026 , eprint=

  59. [59]

    2025 , eprint=

    MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , author=. 2025 , eprint=

  60. [60]

    The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

    Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

  61. [61]

    2026 , eprint=

    Mem-T: Densifying Rewards for Long-Horizon Memory Agents , author=. 2026 , eprint=

  62. [62]

    2025 , eprint=

    Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory , author=. 2025 , eprint=

  63. [63]

    The Fourteenth International Conference on Learning Representations , year=

    ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory , author=. The Fourteenth International Conference on Learning Representations , year=

  64. [64]

    2025 , eprint=

    MemEvolve: Meta-Evolution of Agent Memory Systems , author=. 2025 , eprint=

  65. [65]

    MemSearcher: Training

    Qianhao Yuan and Jie Lou and Zichao Li and Jiawei Chen and Yaojie Lu and Hongyu Lin and Le Sun and Debing Zhang and Xianpei Han , year=. MemSearcher: Training

  66. [66]

    and Niranjan, Mahesan , year =

    Rummery, G. and Niranjan, Mahesan , year =. On-Line Q-Learning Using Connectionist Systems , journal =

  67. [67]

    2025 , eprint=

    LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners , author=. 2025 , eprint=

  68. [68]

    Gonzalez , booktitle=

    Shishir G Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng-Jie Ji and Vishnu Suresh and Ion Stoica and Joseph E. Gonzalez , booktitle=. The Berkeley Function Calling Leaderboard (. 2025 , url=

  69. [69]

    The Thirteenth International Conference on Learning Representations , year=

    LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

  70. [70]

    MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

    Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and Su, Yu and Chen, Wenhu and Neubig, Graham. MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational L...