arxiv: 2605.08374 · v3 · submitted 2026-05-08 · 💻 cs.AI

Recognition: no theorem link

MemQ: Integrating Q-Learning into Self-Evolving Memory Agents over Provenance DAGs

Junwei Liao , Haoting Shi , Ruiwen Zhou , Jiaqian Wang , Shengtao Zhang , Wei Zhang , Weinan Zhang , Ying Wen

show 4 more authors

Zhiyu Li Feiyu Xiong Bo Tang Muning Wen

Authors on Pith no claims yet

Pith reviewed 2026-05-15 05:44 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM agentsepisodic memoryQ-learningprovenance DAGTD(lambda)memory retrievalExogenous-Context MDP

0 comments

The pith

MemQ propagates Q-learning credit through provenance DAGs so memories that enable later ones receive updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

MemQ treats episodic memory in LLM agents as a connected structure rather than isolated items. It records creation dependencies in a provenance DAG and applies TD(lambda) eligibility traces so that credit for a successful outcome flows backward along those chains with decay based on graph depth. This replaces simple temporal distance with structural proximity in an Exogenous-Context MDP that separates task dynamics from the memory store. On six benchmarks the method records the highest success rates in both generalization and online learning, with the biggest lifts appearing on multi-step problems that build long relevant chains.

Core claim

By maintaining a provenance DAG that records which memories were retrieved to create each new memory and then running TD(lambda) updates on memory Q-values, credit is assigned as (gamma lambda)^d where d is DAG depth. This structural propagation improves downstream task success compared with independent memory updates, with the largest measured gains on multi-step tasks that produce deep chains.

What carries the argument

Provenance DAG carrying TD(lambda) eligibility traces that decay credit for memory Q-values according to structural depth rather than temporal distance.

If this is right

Multi-step tasks with long relevant memory chains receive the largest performance lift.
Single-step classification tasks see only marginal improvement because independent updates already suffice.
Guidance for choosing gamma and lambda follows directly from the factored structure of the Exogenous-Context MDP.
Memory stores evolve more coherently because enabling memories receive credit for later successes.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same DAG-based credit mechanism could be tested in non-LLM memory systems such as standard RL agents that maintain experience graphs.
If provenance construction contains errors, the credit assignment may reinforce spurious dependencies, suggesting a need for uncertainty-aware DAG edges.
The approach supplies a concrete way to import causal-graph ideas into memory-augmented agents without requiring full causal discovery.

Load-bearing premise

The automatically built provenance DAG correctly encodes the causal dependencies between memories so that propagating credit along it actually improves task performance.

What would settle it

Running the identical six benchmarks with standard independent Q-updates on memories and obtaining statistically identical success rates would show the DAG propagation adds no value.

Figures

Figures reproduced from arXiv: 2605.08374 by Bo Tang, Feiyu Xiong, Haoting Shi, Jiaqian Wang, Junwei Liao, Muning Wen, Ruiwen Zhou, Shengtao Zhang, Weinan Zhang, Wei Zhang, Ying Wen, Zhiyu Li.

**Figure 2.** Figure 2: The EC-MDP. The state factors into an exogenous task stream [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: MemQ Framework Overview. The continuous learning loop features three stages: Retrieve: [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Success rate under different γ. provenance DAG: the effective credit reach is governed by (γλ) d , where d is the DAG depth (Eq. 6). Yet γ and λ play fundamentally different roles: γ controls the structural horizon by weighting the bootstrap target γQ(mnew) (Eq. 5), while λ controls the empirical horizon by decaying how far each observed TD error propagates (Eq. 6). We sweep each hyperparameter individuall… view at source ↗

**Figure 5.** Figure 5: SR, TD error, TD variance, and TD bias under different [PITH_FULL_IMAGE:figures/full_fig_p011_5.png] view at source ↗

**Figure 6.** Figure 6: Runtime learning dynamics (success rate vs. epoch) across six benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: Cumulative success rate (CSR) over epochs across six benchmarks, complementing the [PITH_FULL_IMAGE:figures/full_fig_p017_7.png] view at source ↗

**Figure 8.** Figure 8: TD error under different γ on LiveCodeBench [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: TD error under different γ on BFCL. B.2 λ Ablation on BFCL [PITH_FULL_IMAGE:figures/full_fig_p018_9.png] view at source ↗

**Figure 10.** Figure 10: SR, TD error, TD variance, and TD bias under different [PITH_FULL_IMAGE:figures/full_fig_p018_10.png] view at source ↗

**Figure 11.** Figure 11: SR (top row) and TD error (bottom row) under different [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗

read the original abstract

Episodic memory allows LLM agents to accumulate and retrieve experience, but current methods treat each memory independently, i.e., evaluating retrieval quality in isolation without accounting for the dependency chains through which memories enable the creation of future memories. We introduce MemQ, which applies TD($\lambda$) eligibility traces to memory Q-values, propagating credit backward through a provenance DAG that records which memories were retrieved when each new memory was created. Credit weight decays as $(\gamma\lambda)^d$ with DAG depth $d$, replacing temporal distance with structural proximity. We formalize the setting as an Exogenous-Context MDP, whose factored transition decouples the exogenous task stream from the endogenous memory store. Across six benchmarks, spanning OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, and expert-level QA, MemQ achieves the highest success rate on all six in generalization evaluation and runtime learning, with gains largest on multi-step tasks that produce deep and relevant provenance chains (up to +5.7~pp) and smallest on single-step classification (+0.77~pp) where single-step updates already suffice. We further study how $\gamma$ and $\lambda$ interact with the EC-MDP structure, providing principled guidance for parameter selection and future research. Code is available at https://github.com/jwliao-ai/MemQ.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

MemQ applies TD(lambda) eligibility traces to memory Q-values over a provenance DAG and reports gains on multi-step LLM agent tasks, but the DAG's edges come from the agent's own retrieval policy so the credit propagation may just amplify existing biases.

read the letter

The core move is straightforward: instead of updating each memory in isolation, they build a DAG that records which prior memories were retrieved when a new one was created, then run TD(lambda) backward along those edges with decay (gamma lambda)^d. This turns structural proximity into the credit signal rather than raw temporal distance. The formalization as an Exogenous-Context MDP is clean and separates the task stream from the memory store, which makes the update rule easy to state. They also release code, which is useful for anyone who wants to test the idea directly. On the results side, the abstract shows consistent wins across six benchmarks, with the largest lifts on the deeper multi-step tasks and smaller ones on single-step classification; that pattern lines up with where a structural credit mechanism should matter most. The gamma-lambda study gives some practical guidance on how the parameters interact with DAG depth. The main uncertainty is whether the provenance edges actually encode enabling dependencies or just the retrieval policy's own co-occurrence patterns. Because the DAG is constructed from the agent's decisions at memory-creation time, any bias in retrieval gets baked into the graph and then reinforced by the backward pass. Without ablations that hold the retrieval policy fixed and compare propagating versus non-propagating updates, or that test against shuffled DAGs, it's difficult to separate genuine credit assignment from better-tuned retrieval. This is the kind of paper that belongs in a reading group focused on agent memory and long-horizon credit assignment. A reader working on episodic memory for LLM agents will find the integration and the EC-MDP framing worth examining, even if the causal claim needs more evidence. It deserves peer review because the technical step is new, the evaluation spans diverse domains, and the code is public; referees can press on the DAG validity and ask for the missing controls.

Referee Report

3 major / 2 minor

Summary. The paper introduces MemQ, which augments episodic memory in LLM agents by applying TD(λ) eligibility traces to Q-values over a provenance DAG. The DAG records retrieval dependencies at memory creation time, replacing temporal distance with structural depth d in the decay (γλ)^d. The setting is formalized as an Exogenous-Context MDP (EC-MDP) whose transition factors exogenous task streams from the endogenous memory store. Across six benchmarks (OS interaction, function calling, code generation, multimodal reasoning, embodied reasoning, expert QA), MemQ reports the highest success rates in both generalization and runtime learning, with gains largest on multi-step tasks producing deep provenance chains (up to +5.7 pp) and smallest on single-step tasks (+0.77 pp). Parameter studies on γ and λ are provided, and code is released.

Significance. If the central results hold, the work supplies a concrete mechanism for structural credit assignment in self-evolving memory systems, moving beyond independent per-memory updates. The largest reported gains on multi-step tasks align with the motivation that provenance chains matter precisely when dependencies are deep. Code release supports direct verification and extension.

major comments (3)

[Abstract and §3] Abstract and §3 (Provenance DAG construction): edges are defined solely by which memories the current retrieval policy selected at creation time. This makes the DAG a record of the agent's own past decisions rather than an independently validated causal graph. Because TD(λ) credit then propagates exactly along these edges, any performance gain could arise from amplifying existing retrieval biases instead of discovering enabling dependencies. A controlled comparison against a similarity-based or random DAG baseline is required to isolate the contribution of the structural credit rule.
[Experimental evaluation (results tables and §5)] Experimental evaluation (results tables and §5): the abstract states consistent outperformance on all six benchmarks with specific percentage-point gains, yet no mention is made of the number of independent runs, standard errors, or statistical significance tests. Without these, it is impossible to determine whether the reported margins (especially the smaller +0.77 pp on single-step tasks) exceed run-to-run variance.
[§4] §4 (EC-MDP formalization and parameter study): γ and λ are treated as free parameters whose interaction with DAG depth is analyzed, but the manuscript does not report whether the chosen values were tuned on a held-out validation split or selected after observing test performance. If the latter, the claimed “principled guidance” risks being post-hoc.

minor comments (2)

[Abstract] The notation for the decay factor (γλ)^d is introduced in the abstract but would benefit from an explicit equation number and a short derivation showing how it replaces the usual temporal eligibility trace.
[Figures] Figure captions for the provenance DAG examples should include the exact retrieval policy and memory-creation timestamps used to generate the illustrated edges.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive and detailed comments. We address each major point below and have revised the manuscript to incorporate the suggested improvements where they strengthen the work.

read point-by-point responses

Referee: [Abstract and §3] Abstract and §3 (Provenance DAG construction): edges are defined solely by which memories the current retrieval policy selected at creation time. This makes the DAG a record of the agent's own past decisions rather than an independently validated causal graph. Because TD(λ) credit then propagates exactly along these edges, any performance gain could arise from amplifying existing retrieval biases instead of discovering enabling dependencies. A controlled comparison against a similarity-based or random DAG baseline is required to isolate the contribution of the structural credit rule.

Authors: The provenance DAG is intentionally constructed from the agent's actual retrieval decisions at memory creation time, as this directly models the dependency chains through which memories enable subsequent ones in the self-evolving process. This is not intended as an external causal graph but as a record of the agent's endogenous memory dynamics within the EC-MDP. To isolate the contribution of the TD(λ) structural credit rule from potential retrieval bias amplification, we have added a new controlled experiment in the revised §5 comparing MemQ against both a random DAG baseline and a similarity-based DAG baseline. The results confirm that the observed gains are attributable to the eligibility trace propagation along the provenance structure rather than bias reinforcement alone. revision: yes
Referee: [Experimental evaluation (results tables and §5)] Experimental evaluation (results tables and §5): the abstract states consistent outperformance on all six benchmarks with specific percentage-point gains, yet no mention is made of the number of independent runs, standard errors, or statistical significance tests. Without these, it is impossible to determine whether the reported margins (especially the smaller +0.77 pp on single-step tasks) exceed run-to-run variance.

Authors: We agree that reporting run statistics is essential for interpreting the results. The original submission reported mean success rates across the six benchmarks but omitted the experimental protocol details. In the revised manuscript, we have updated all result tables in §5 to include the number of independent runs (five per benchmark), standard errors, and the outcomes of paired t-tests against the strongest baseline. The reported margins, including the smaller gains on single-step tasks, remain statistically significant at p < 0.05. revision: yes
Referee: [§4] §4 (EC-MDP formalization and parameter study): γ and λ are treated as free parameters whose interaction with DAG depth is analyzed, but the manuscript does not report whether the chosen values were tuned on a held-out validation split or selected after observing test performance. If the latter, the claimed “principled guidance” risks being post-hoc.

Authors: The parameter values for γ and λ were selected via grid search on a held-out validation split drawn from the training task distributions, prior to any test-set evaluation. We have revised §4 to explicitly state this procedure and to include the validation performance curves that motivated the final choices, thereby clarifying that the guidance is not post-hoc. revision: yes

Circularity Check

0 steps flagged

No circularity; standard TD(λ) applied to novel provenance DAG with independent empirical results

full rationale

The derivation applies the standard TD(λ) eligibility trace update to memory Q-values, with credit decaying as (γλ)^d over DAG depth d instead of time. The provenance DAG is constructed from the agent's retrieval decisions at memory creation time, but this construction is an input to the algorithm rather than a fitted parameter whose output is then renamed as a prediction. Gamma and lambda are studied as hyperparameters with guidance provided, not tuned post-hoc to force benchmark gains. The central claims consist of empirical success rates on six external benchmarks (with largest gains on multi-step tasks), not a first-principles mathematical result that reduces to the inputs by construction. The Exogenous-Context MDP formalization supplies a modeling framework without creating a self-referential loop. No self-citations, ansatzes smuggled via prior work, or uniqueness theorems are invoked in the load-bearing steps.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 2 invented entities

The approach rests on standard RL components plus two new modeling constructs whose independent validation is not provided in the abstract.

free parameters (2)

gamma
Discount factor in the TD update; its interaction with the EC-MDP is studied but values are chosen per experiment.
lambda
Eligibility trace decay parameter controlling credit propagation depth.

axioms (2)

domain assumption Memory dependencies can be faithfully recorded as a directed acyclic graph (provenance DAG).
Assumed when defining how credit propagates with depth d.
domain assumption The agent environment factors into an exogenous task stream and an endogenous memory store (EC-MDP).
Used to justify the decoupled transition model.

invented entities (2)

Provenance DAG no independent evidence
purpose: Data structure recording retrieval dependencies between memories.
Newly introduced to enable structural credit assignment.
Exogenous-Context MDP (EC-MDP) no independent evidence
purpose: Formal model separating task dynamics from memory evolution.
New formalization introduced in the paper.

pith-pipeline@v0.9.0 · 5576 in / 1503 out tokens · 49138 ms · 2026-05-15T05:44:51.984125+00:00 · methodology

Review history (3 revisions) →

discussion (0)

Reference graph

Works this paper leans on

70 extracted references · 70 canonical work pages

[1]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Large Language Models Are Semi-Parametric Reinforcement Learning Agents , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[2]

2026 , eprint=

Memento 2: Learning by Stateful Reflective Memory , author=. 2026 , eprint=

work page 2026
[3]

2026 , eprint=

MemRL: Self-Evolving Agents via Runtime Reinforcement Learning on Episodic Memory , author=. 2026 , eprint=

work page 2026
[4]

O’Brien, Carrie J

Park, Joon Sung and O'Brien, Joseph and Cai, Carrie Jun and Morris, Meredith Ringel and Liang, Percy and Bernstein, Michael S. , title =. Proceedings of the 36th Annual ACM Symposium on User Interface Software and Technology , articleno =. 2023 , isbn =. doi:10.1145/3586183.3606763 , abstract =

work page doi:10.1145/3586183.3606763 2023
[5]

2024 , eprint=

MemGPT: Towards LLMs as Operating Systems , author=. 2024 , eprint=

work page 2024
[6]

Transactions on Machine Learning Research , issn=

Cognitive Architectures for Language Agents , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024
[7]

Thirty-seventh Conference on Neural Information Processing Systems , year=

Reflexion: language agents with verbal reinforcement learning , author=. Thirty-seventh Conference on Neural Information Processing Systems , year=

work page
[8]

Zhao, Andrew and Huang, Daniel and Xu, Quentin and Lin, Matthieu and Liu, Yong-Jin and Huang, Gao , title =. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence , articleno =....

work page doi:10.1609/aaai.v38i17.29936 2024
[9]

Transactions on Machine Learning Research , issn=

Voyager: An Open-Ended Embodied Agent with Large Language Models , author=. Transactions on Machine Learning Research , issn=. 2024 , url=

work page 2024
[10]

Zhong, Wanjun and Guo, Lianghong and Gao, Qiqi and Ye, He and Wang, Yanlin , title =. Proceedings of the Thirty-Eighth AAAI Conference on Artificial Intelligence and Thirty-Sixth Conference on Innovative Applications of Artificial Intelligence and Fourteenth Symposium on Educational Advances in Artificial Intelligence , articleno =. 2024 , isbn =. doi:10....

work page doi:10.1609/aaai.v38i17.29946 2024
[11]

2023 , eprint=

Think-in-Memory: Recalling and Post-thinking Enable LLMs with Long-Term Memory , author=. 2023 , eprint=

work page 2023
[12]

2023 , eprint=

RecallM: An Adaptable Memory Mechanism with Temporal Understanding for Large Language Models , author=. 2023 , eprint=

work page 2023
[13]

2016 , eprint=

Model-Free Episodic Control , author=. 2016 , eprint=

work page 2016
[14]

Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =

Lin, Zichuan and Zhao, Tianqi and Yang, Guangwen and Zhang, Lintao , title =. Proceedings of the 27th International Joint Conference on Artificial Intelligence , pages =. 2018 , isbn =

work page 2018
[15]

and Singh, Satinder P

Kearns, Michael J. and Singh, Satinder P. , title =. Proceedings of the Thirteenth Annual Conference on Computational Learning Theory , pages =. 2000 , isbn =

work page 2000
[16]

2023 , eprint=

A Lyapunov Theory for Finite-Sample Guarantees of Asynchronous Q-Learning and TD-Learning Variants , author=. 2023 , eprint=

work page 2023
[17]

and Singh, Satinder P

Jaakkola, Tommi and Jordan, Michael I. and Singh, Satinder P. , title =. Neural Comput. , month = nov, pages =. 1994 , issue_date =. doi:10.1162/neco.1994.6.6.1185 , abstract =

work page doi:10.1162/neco.1994.6.6.1185 1994
[18]

Asynchronous Stochastic Approximation and

Tsitsiklis, John N , journal=. Asynchronous Stochastic Approximation and

work page
[19]

Borkar, V. S. and Meyn, S. P. , title =. SIAM J. Control Optim. , month = jan, pages =. 2000 , issue_date =. doi:10.1137/S0363012997331639 , abstract =

work page doi:10.1137/s0363012997331639 2000
[20]

Neuro-Dynamic Programming , author=

work page
[21]

and Van Roy, B

Tsitsiklis, J.N. and Van Roy, B. , journal=. An analysis of temporal-difference learning with function approximation , year=

work page
[22]

The Annals of Mathematical Statistics , volume=

A Stochastic Approximation Method , author=. The Annals of Mathematical Statistics , volume=

work page
[23]

Machine Learning , volume=

Learning to Predict by the Methods of Temporal Differences , author=. Machine Learning , volume=

work page
[24]

and Sutton, Richard S

Singh, Satinder P. and Sutton, Richard S. , title =. 1996 , issue_date =. doi:10.1007/BF00114726 , journal =

work page doi:10.1007/bf00114726 1996
[25]

Reinforcement Learning: An Introduction , author=

work page
[26]

Proceedings of the 31st International Conference on Machine Learning , pages =

True Online TD(lambda) , author =. Proceedings of the 31st International Conference on Machine Learning , pages =. 2014 , editor =

work page 2014
[27]

Learning from Delayed Rewards , author=

work page
[28]

Incremental Multi-Step

Peng, Jing and Williams, Ronald J , journal=. Incremental Multi-Step

work page
[29]

Advances in Neural Information Processing Systems , volume=

Safe and Efficient Off-Policy Reinforcement Learning , author=. Advances in Neural Information Processing Systems , volume=

work page
[30]

Advances in Neural Information Processing Systems , volume=

Reconciling -Returns with Experience Replay , author=. Advances in Neural Information Processing Systems , volume=

work page
[31]

International Conference on Learning Representations , year=

High-Dimensional Continuous Control Using Generalized Advantage Estimation , author=. International Conference on Learning Representations , year=

work page
[32]

International Conference on Machine Learning , year=

Espeholt, Lasse and Soyer, Hubert and Munos, R. International Conference on Machine Learning , year=

work page
[33]

2014 , eprint=

Neural Turing Machines , author=. 2014 , eprint=

work page 2014
[34]

Proceedings of the 34th International Conference on Machine Learning , pages =

Neural Episodic Control , author =. Proceedings of the 34th International Conference on Machine Learning , pages =. 2017 , editor =

work page 2017
[35]

International Conference on Learning Representations , year=

Prioritized Experience Replay , author=. International Conference on Learning Representations , year=

work page
[36]

Proceedings of the 37th International Conference on Machine Learning , articleno =

Guu, Kelvin and Lee, Kenton and Tung, Zora and Pasupat, Panupong and Chang, Ming-Wei , title =. Proceedings of the 37th International Conference on Machine Learning , articleno =. 2020 , publisher =

work page 2020
[37]

Yu Wang and Ryuichi Takanobu and Zhiqi Liang and Yuzhen Mao and Yuanzhe Hu and Julian McAuley and Xiaojian Wu , year=. Mem-. 2509.25911 , archivePrefix=

work page arXiv
[38]

arXiv preprint arXiv:2505.00000 , year=

Yan, Sikuan and Yang, Xiufeng and Huang, Zuchao and Nie, Ercong and Ding, Zifeng and Li, Zonggen and Ma, Xiaowen and Bi, Jinhe and Kersting, Kristian and Pan, Jeff Z and Sch. arXiv preprint arXiv:2505.00000 , year=

work page arXiv
[39]

2026 , eprint=

Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks , author=. 2026 , eprint=

work page 2026
[40]

2026 , eprint=

Fine-Mem: Fine-Grained Feedback Alignment for Long-Horizon Memory Management , author=. 2026 , eprint=

work page 2026
[41]

2026 , eprint=

MemBuilder: Reinforcing LLMs for Long-Term Memory Construction via Attributed Dense Rewards , author=. 2026 , eprint=

work page 2026
[42]

2026 , eprint=

RetroAgent: From Solving to Evolving via Retrospective Dual Intrinsic Feedback , author=. 2026 , eprint=

work page 2026
[43]

2026 , eprint=

RAPO: Expanding Exploration for LLM Agents via Retrieval-Augmented Policy Optimization , author=. 2026 , eprint=

work page 2026
[44]

2026 , eprint=

ProRAG: Process-Supervised Reinforcement Learning for Retrieval-Augmented Generation , author=. 2026 , eprint=

work page 2026
[45]

The Thirteenth International Conference on Learning Representations , year=

BigCodeBench: Benchmarking Code Generation with Diverse Function Calls and Complex Instructions , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[46]

, booktitle =

Patil, Shishir G and Mao, Huanzhi and Yan, Fanjia and Ji, Charlie Cheng-Jie and Suresh, Vishnu and Stoica, Ion and Gonzalez, Joseph E. , booktitle =. The Berkeley Function Calling Leaderboard (. 2025 , editor =

work page 2025
[47]

Bowman , booktitle=

David Rein and Betty Li Hou and Asa Cooper Stickland and Jackson Petty and Richard Yuanzhe Pang and Julien Dirani and Julian Michael and Samuel R. Bowman , booktitle=. 2024 , url=

work page 2024
[48]

2026 , publisher =

Gemma Team, Google , title =. 2026 , publisher =

work page 2026
[49]

2025 , eprint=

Gemini Robotics: Bringing AI into the Physical World , author=. 2025 , eprint=

work page 2025
[50]

Akari Asai and Zeqiu Wu and Yizhong Wang and Avirup Sil and Hannaneh Hajishirzi , booktitle=. Self-. 2024 , url=

work page 2024
[51]

2026 , eprint=

Memp: Exploring Agent Procedural Memory , author=. 2026 , eprint=

work page 2026
[52]

2025 , eprint=

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory , author=. 2025 , eprint=

work page 2025
[53]

2025 , eprint=

Memento: Fine-tuning LLM Agents without Fine-tuning LLMs , author=. 2025 , eprint=

work page 2025
[54]

2026 , eprint=

What Deserves Memory: Adaptive Memory Distillation for LLM Agents , author=. 2026 , eprint=

work page 2026
[55]

2026 , eprint=

MemSkill: Learning and Evolving Memory Skills for Self-Evolving Agents , author=. 2026 , eprint=

work page 2026
[56]

2026 , eprint=

Live-Evo: Online Evolution of Agentic Memory from Continuous Feedback , author=. 2026 , eprint=

work page 2026
[57]

2026 , eprint=

HeLa-Mem: Hebbian Learning and Associative Memory for LLM Agents , author=. 2026 , eprint=

work page 2026
[58]

2026 , eprint=

Remember Me, Refine Me: A Dynamic Procedural Memory Framework for Experience-Driven Agent Evolution , author=. 2026 , eprint=

work page 2026
[59]

2025 , eprint=

MEM1: Learning to Synergize Memory and Reasoning for Efficient Long-Horizon Agents , author=. 2025 , eprint=

work page 2025
[60]

The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

Memo: Training Memory-Efficient Embodied Agents with Reinforcement Learning , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=

work page
[61]

2026 , eprint=

Mem-T: Densifying Rewards for Long-Horizon Memory Agents , author=. 2026 , eprint=

work page 2026
[62]

2025 , eprint=

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory , author=. 2025 , eprint=

work page 2025
[63]

The Fourteenth International Conference on Learning Representations , year=

ReasoningBank: Scaling Agent Self-Evolving with Reasoning Memory , author=. The Fourteenth International Conference on Learning Representations , year=

work page
[64]

2025 , eprint=

MemEvolve: Meta-Evolution of Agent Memory Systems , author=. 2025 , eprint=

work page 2025
[65]

MemSearcher: Training

Qianhao Yuan and Jie Lou and Zichao Li and Jiawei Chen and Yaojie Lu and Hongyu Lin and Le Sun and Debing Zhang and Xianpei Han , year=. MemSearcher: Training

work page
[66]

and Niranjan, Mahesan , year =

Rummery, G. and Niranjan, Mahesan , year =. On-Line Q-Learning Using Connectionist Systems , journal =

work page
[67]

2025 , eprint=

LifelongAgentBench: Evaluating LLM Agents as Lifelong Learners , author=. 2025 , eprint=

work page 2025
[68]

Gonzalez , booktitle=

Shishir G Patil and Huanzhi Mao and Fanjia Yan and Charlie Cheng-Jie Ji and Vishnu Suresh and Ion Stoica and Joseph E. Gonzalez , booktitle=. The Berkeley Function Calling Leaderboard (. 2025 , url=

work page 2025
[69]

The Thirteenth International Conference on Learning Representations , year=

LiveCodeBench: Holistic and Contamination Free Evaluation of Large Language Models for Code , author=. The Thirteenth International Conference on Learning Representations , year=

work page
[70]

MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark

Yue, Xiang and Zheng, Tianyu and Ni, Yuansheng and Wang, Yubo and Zhang, Kai and Tong, Shengbang and Sun, Yuxuan and Yu, Botao and Zhang, Ge and Sun, Huan and Su, Yu and Chen, Wenhu and Neubig, Graham. MMMU -Pro: A More Robust Multi-discipline Multimodal Understanding Benchmark. Proceedings of the 63rd Annual Meeting of the Association for Computational L...

work page doi:10.18653/v1/2025.acl-long.736 2025