The Past Is Prologue: A Plug-in Controller for Selective Updates in Sequentially Evolving LLM Memory

Chengshuai Shi; Cong Shen; Jundong Li; Peng Wang; Song Wang; Songwei Dong; Zihan Chen

arxiv: 2606.31121 · v1 · pith:4DYCH62Onew · submitted 2026-06-30 · 💻 cs.AI

The Past Is Prologue: A Plug-in Controller for Selective Updates in Sequentially Evolving LLM Memory

Zihan Chen , Songwei Dong , Chengshuai Shi , Peng Wang , Song Wang , Cong Shen , Jundong Li This is my paper

Pith reviewed 2026-07-01 05:57 UTC · model grok-4.3

classification 💻 cs.AI

keywords LLM memorysequential updatesmemory controllerplug-in controllerselective updatesmemory momentum triggerhybrid evaluation set

0 comments

The pith

Janus is a plug-in controller that decides whether to accept or reject each LLM memory update by flagging suspicious trajectory deviations and testing on a compact hybrid set of coverage, boundary, and fresh tasks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Sequentially evolving LLM memory lets agents reuse past experience, yet most systems apply every locally generated update without checking its effect on future behavior. This can overwrite useful knowledge, add over-specific rules, or bias memory toward recent examples. Janus wraps any existing updater without changing its rules and uses a Memory Momentum Trigger to spot problematic changes, then compares the old and new memory only on a small hybrid evaluation set rather than replaying full history. Across six datasets, two backbone LLMs, and two memory updaters, the controller raises average accuracy by 2.7 to 4.6 points. The result is more reliable long-term memory at low extra cost.

Core claim

Janus is a method-agnostic plug-in that decides whether to accept a candidate memory update or retain the previous memory. It identifies suspicious deviations in the update trajectory with the Memory Momentum Trigger and compares the two memories on a compact hybrid evaluation set of coverage, boundary, and fresh tasks instead of replaying the full history, producing consistent accuracy gains of 2.7 to 4.6 points over the base updaters.

What carries the argument

The Memory Momentum Trigger that detects suspicious deviations in the memory-update trajectory, paired with comparison of old and new memories on a compact hybrid evaluation set of coverage, boundary, and fresh tasks.

If this is right

Existing memory updaters can be wrapped and improved without modifying their internal update rules.
Updates that aid the current task but harm future performance are rejected before they overwrite useful knowledge.
Full history replay is avoided while still guarding against over-specific rules or recent-example bias.
The same controller works across different backbone LLMs and different base memory updaters.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The selective-accept mechanism could extend to other sequential knowledge stores that agents maintain over long interactions.
Dynamic adjustment of the hybrid evaluation set size or composition might further reduce false rejections on novel task types.
The momentum trigger could be combined with uncertainty estimates from the base updater to refine rejection thresholds.

Load-bearing premise

That performance on the compact hybrid evaluation set reliably predicts whether an update will improve behavior on unseen future tasks without replaying the full history.

What would settle it

On a new long sequence or dataset, always accepting updates produces higher accuracy on held-out tasks than the Janus-controlled updates.

Figures

Figures reproduced from arXiv: 2606.31121 by Chengshuai Shi, Cong Shen, Jundong Li, Peng Wang, Song Wang, Songwei Dong, Zihan Chen.

**Figure 1.** Figure 1: Top: Existing sequential memory updates do not guarantee that the final memory will better support future tasks. Bottom: On GPQA, intermediate memory snapshots from two sequential memory methods exhibit non-monotonic test set performance, motivating the need to control which memory updates are deployed. are important for reasoning assistants (Ho et al., 2025), tool-use agents (Wang et al., 2025b), and int… view at source ↗

**Figure 2.** Figure 2: Overview of Janus. Given a task and the current memory Mt−1, a base updater proposes a candidate memory Mct. Janus acts as a plug-in controller that decides whether to deploy this candidate or retain the previous memory. It first uses a Memory Momentum Trigger to detect whether the candidate update deviates from the recent memory-update trajectory. If triggered, Janus compares Mt−1 and Mct on a compact hyb… view at source ↗

**Figure 3.** Figure 3: Ablation study of the hybrid trigger-time evaluation set. We evaluate Janus with Qwen3-8B and the DC-RS updater on GPQA and HumanEval, reporting final test accuracy. Each ablation removes one component: coverage, boundary, or fresh tasks. Janus achieves stronger or comparable accuracy under similar trigger budgets, especially on GPQA. This suggests that MMT is not merely reducing the number of comparisons… view at source ↗

**Figure 4.** Figure 4: Old-vs-new deployment decision ablation. We evaluate memory checkpoints obtained after processing 20%, 40%, 60%, 80%, and 100% of the task stream using Qwen3-8B on GPQA and MMLU-Pro (Eng.). The base updater directly deploys every candidate memory, while Janus selectively accepts or rejects candidate updates. stored support set can make the comparison stale and self-reinforcing. Fresh tasks provide recent… view at source ↗

**Figure 5.** Figure 5: Stream order analysis. We evaluate the final frozen-memory test accuracy under five different shuffled task-stream orders using Qwen3-8B on HumanEval. Each point corresponds to one stream order, and error bars show the variation across orders. A Stream Order Analysis [PITH_FULL_IMAGE:figures/full_fig_p012_5.png] view at source ↗

read the original abstract

Sequentially evolving LLM memory enables agents to reuse past experience, but existing systems usually deploy each locally generated memory update without checking whether it improves future behavior. As a result, updates that help the current task may overwrite useful knowledge, introduce over-specific rules, or bias the final memory toward recent examples. We propose Janus, a plug-in memory controller that decides whether to accept a candidate memory update or retain the previous memory. To make this decision efficient, Janus uses a Memory Momentum Trigger to identify suspicious deviations in the memory-update trajectory, and compares old and new memories on a compact hybrid evaluation set of coverage, boundary, and fresh tasks instead of replaying the full history. Janus is method-agnostic and wraps existing updaters without changing their update rules. Across six datasets, two backbone LLMs, and two memory updaters, Janus improves average accuracy by +2.7 to +4.6 points over the corresponding base updaters.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Janus is a plug-in controller that selectively accepts LLM memory updates via a momentum trigger and hybrid task set, with modest reported gains, but the evaluation shares the same datasets for decisions and accuracy measurement.

read the letter

The core contribution is the wrapper that sits on top of existing memory updaters without changing their rules. It watches the update trajectory for momentum shifts to flag potential problems, then compares the old and new memory states on a small hybrid set of coverage, boundary, and fresh tasks. This replaces full history replay. The authors test it across six datasets, two backbone LLMs, and two different updaters, and report average accuracy lifts between 2.7 and 4.6 points.

That setup is practical for long-running agent systems where replay cost grows. The method-agnostic design is a clear plus if you already have an updater you like and just want a guardrail.

The soft spot is the evaluation loop. The hybrid set drives both the accept/reject decision and the accuracy numbers that are reported. Because those numbers come from the same six datasets whose tasks populate the hybrid set, the gains could simply mean the controller is good at keeping updates that score well on its own metric. The abstract does not state whether final accuracy uses tasks strictly outside the ones seen during decision-making. If the full paper shows a clean separation or extra held-out tests, that would fix it; otherwise the central claim rests on a potentially circular proxy.

The paper stays empirical and does not lean on unstated derivations or free parameters. It addresses a real engineering pain point in sequential memory without overclaiming theory.

This is for people who build or tune memory in LLM agents and need something lightweight to reduce drift. A reader working on long-horizon systems would get a concrete controller to try. It has enough substance and clear thinking to go to peer review so the experimental protocol and any ablations can be checked directly.

Referee Report

2 major / 0 minor

Summary. The manuscript proposes Janus, a plug-in memory controller for sequentially evolving LLM memory. Janus decides whether to accept candidate memory updates or retain previous memory by using a Memory Momentum Trigger to identify suspicious deviations and comparing memories on a compact hybrid evaluation set of coverage, boundary, and fresh tasks. The approach is method-agnostic and is evaluated across six datasets, two backbone LLMs, and two memory updaters, claiming average accuracy improvements of +2.7 to +4.6 points over base updaters.

Significance. If the reported gains are shown to result from genuine improvements in generalization rather than circular evaluation, Janus could offer an efficient way to manage memory updates in LLM-based agents, reducing the risk of overwriting useful knowledge or introducing biases from recent examples. The plug-in nature allows integration with existing updaters without modification, which is a practical strength.

major comments (2)

The abstract claims empirical gains of +2.7 to +4.6 points but provides no information on experimental protocol, statistical tests, baseline details, or ablation studies, making it impossible to evaluate the soundness of the central performance claim.
The hybrid evaluation set is used both for the Memory Momentum Trigger's decisions and for measuring the accuracy improvements. This setup risks circular evaluation, where the controller may simply retain updates that perform well on the decision metric without ensuring better performance on truly unseen future tasks.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback. We address the two major comments point by point below, clarifying the experimental details and evaluation design while indicating planned revisions.

read point-by-point responses

Referee: The abstract claims empirical gains of +2.7 to +4.6 points but provides no information on experimental protocol, statistical tests, baseline details, or ablation studies, making it impossible to evaluate the soundness of the central performance claim.

Authors: We agree the abstract is concise and omits key details. The full manuscript (Sections 4.1–4.3 and 5) specifies the six datasets, two backbone LLMs, two memory updaters, the protocol of multiple independent runs with reported means and standard deviations, and the ablation studies in Section 5.2. We will revise the abstract to include a short statement on the multi-dataset, multi-model evaluation to better support the performance claims. revision: yes
Referee: The hybrid evaluation set is used both for the Memory Momentum Trigger's decisions and for measuring the accuracy improvements. This setup risks circular evaluation, where the controller may simply retain updates that perform well on the decision metric without ensuring better performance on truly unseen future tasks.

Authors: The hybrid evaluation set (coverage, boundary, and fresh tasks) is used only inside the Memory Momentum Trigger for the accept/reject decision. Reported accuracy gains are measured on separate held-out test sets that are disjoint from the hybrid set; these test sets are never seen by the trigger. We will add an explicit statement in Section 3.2 and the experimental setup to make this separation unambiguous. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical results independent of inputs

full rationale

The paper contains no equations, derivations, or first-principles claims that could reduce to their inputs by construction. The central result is an empirical accuracy improvement (+2.7 to +4.6 points) measured on six datasets after applying the Janus controller; this is presented as an experimental outcome rather than a quantity defined from fitted parameters or the hybrid evaluation set itself. No self-citation chains, uniqueness theorems, or ansatzes are invoked as load-bearing steps. The hybrid set is used for update decisions, but the reported gains are not shown to be statistically forced by that same metric, leaving the evaluation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the unverified assumption that the hybrid task set is representative and on the empirical observation of accuracy gains; no free parameters, axioms, or invented entities beyond the named components are stated in the abstract.

axioms (1)

domain assumption Performance on the compact hybrid evaluation set of coverage, boundary, and fresh tasks reliably indicates whether a memory update improves future behavior.
This premise allows the method to avoid full-history replay and is required for the decision rule to be valid.

invented entities (1)

Memory Momentum Trigger no independent evidence
purpose: Identify suspicious deviations in the memory-update trajectory
New component introduced to trigger selective evaluation; no independent evidence supplied in abstract.

pith-pipeline@v0.9.1-grok · 5715 in / 1396 out tokens · 27953 ms · 2026-07-01T05:57:53.822219+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

37 extracted references · 17 canonical work pages · 14 internal anchors

[1]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Dynamic cheatsheet: Test-time learning with adaptive memory , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=
[2]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=
[3]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[4]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[5]

Advances in Neural Information Processing Systems , volume=

Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , volume=
[6]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[7]

International Conference on Learning Representations , volume=

Let's verify step by step , author=. International Conference on Learning Representations , volume=
[8]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[9]

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=
[10]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Expel: Llm agents are experiential learners , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
[11]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory , author=. arXiv preprint arXiv:2511.20857 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[12]

Memp: Exploring Agent Procedural Memory

Memp: Exploring agent procedural memory , author=. arXiv preprint arXiv:2508.06433 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[13]

Available at SSRN 6626878 , year=

A Systematic Survey of Self-Evolving Agents: From Model-Centric to Environment-Driven Co-Evolution , author=. Available at SSRN 6626878 , year=
[14]

arXiv preprint arXiv:2508.16153 , year=

Memento: Fine-tuning llm agents without fine-tuning llms , author=. arXiv preprint arXiv:2508.16153 , year=

work page arXiv
[15]

arXiv preprint arXiv:2509.04439 , year=

Arcmemo: Abstract reasoning composition with lifelong llm memory , author=. arXiv preprint arXiv:2509.04439 , year=

work page arXiv
[16]

Reinforcement Learning for Self-Improving Agent with Skill Library

Reinforcement learning for self-improving agent with skill library , author=. arXiv preprint arXiv:2512.17102 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[17]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Skillweaver: Web agents can self-improve by discovering and honing skills , author=. arXiv preprint arXiv:2504.07079 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[18]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Gepa: Reflective prompt evolution can outperform reinforcement learning , author=. arXiv preprint arXiv:2507.19457 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[19]

Advances in neural information processing systems , volume=

Gradient episodic memory for continual learning , author=. Advances in neural information processing systems , volume=
[20]

International Conference on Learning Representations 2022 , year=

Pretrained Language Model in Continual Learning: A Comparative Study , author=. International Conference on Learning Representations 2022 , year=

2022
[21]

International Conference on Machine Learning , pages=

Agent Workflow Memory , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025
[22]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Skillrl: Evolving agents via recursive skill-augmented reinforcement learning , author=. arXiv preprint arXiv:2602.08234 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[23]

Advances in Neural Information Processing Systems , volume=

G-memory: Tracing hierarchical memory for multi-agent systems , author=. Advances in Neural Information Processing Systems , volume=
[24]

A-MEM: Agentic Memory for LLM Agents

A-mem: Agentic memory for llm agents , author=. arXiv preprint arXiv:2502.12110 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[25]

arXiv preprint arXiv:2507.23361

Swe-exp: Experience-driven software issue resolution , author=. arXiv preprint arXiv:2507.23361 , year=

work page arXiv
[26]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025
[27]

Proceedings of the AAAI conference on artificial intelligence , volume=

Memorybank: Enhancing large language models with long-term memory , author=. Proceedings of the AAAI conference on artificial intelligence , volume=
[28]

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems , author=. arXiv preprint arXiv:2508.07407 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[29]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=
[30]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0: Building production-ready ai agents with scalable long-term memory , author=. arXiv preprint arXiv:2504.19413 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[31]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning , author=. arXiv preprint arXiv:2508.19828 , year=

work page internal anchor Pith review Pith/arXiv arXiv
[32]

Nature , volume=

Optimizing generative ai by backpropagating language model feedback , author=. Nature , volume=. 2025 , publisher=

2025
[33]

International Conference on Learning Representations , year=

Tent: Fully Test-Time Adaptation by Entropy Minimization , author=. International Conference on Learning Representations , year=
[34]

Advances in neural information processing systems , volume=

Memo: Test time robustness via adaptation and augmentation , author=. Advances in neural information processing systems , volume=
[35]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=
[36]

International Conference on Learning Representations , volume=

Critic: Large language models can self-correct with tool-interactive critiquing , author=. International Conference on Learning Representations , volume=
[37]

International conference on learning representations , volume=

Self-rag: Learning to retrieve, generate, and critique through self-reflection , author=. International conference on learning representations , volume=

[1] [1]

Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

Dynamic cheatsheet: Test-time learning with adaptive memory , author=. Proceedings of the 19th Conference of the European Chapter of the Association for Computational Linguistics (Volume 1: Long Papers) , pages=

[2] [2]

Advances in Neural Information Processing Systems , volume=

Mmlu-pro: A more robust and challenging multi-task language understanding benchmark , author=. Advances in Neural Information Processing Systems , volume=

[3] [3]

GPQA: A Graduate-Level Google-Proof Q&A Benchmark

Gpqa: A graduate-level google-proof q&a benchmark , author=. arXiv preprint arXiv:2311.12022 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

Evaluating Large Language Models Trained on Code

Evaluating large language models trained on code , author=. arXiv preprint arXiv:2107.03374 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[5] [5]

Advances in Neural Information Processing Systems , volume=

Gorilla: Large language model connected with massive apis , author=. Advances in Neural Information Processing Systems , volume=

[6] [6]

Measuring Mathematical Problem Solving With the MATH Dataset

Measuring mathematical problem solving with the math dataset , author=. arXiv preprint arXiv:2103.03874 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[7] [7]

International Conference on Learning Representations , volume=

Let's verify step by step , author=. International Conference on Learning Representations , volume=

[8] [8]

Qwen3 Technical Report

Qwen3 technical report , author=. arXiv preprint arXiv:2505.09388 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

DeepSeek-V4: Towards Highly Efficient Million-Token Context Intelligence , author=

[10] [10]

Proceedings of the AAAI Conference on Artificial Intelligence , volume=

Expel: Llm agents are experiential learners , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=

[11] [11]

Evo-Memory: Benchmarking LLM Agent Test-time Learning with Self-Evolving Memory

Evo-memory: Benchmarking llm agent test-time learning with self-evolving memory , author=. arXiv preprint arXiv:2511.20857 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[12] [12]

Memp: Exploring Agent Procedural Memory

Memp: Exploring agent procedural memory , author=. arXiv preprint arXiv:2508.06433 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[13] [13]

Available at SSRN 6626878 , year=

A Systematic Survey of Self-Evolving Agents: From Model-Centric to Environment-Driven Co-Evolution , author=. Available at SSRN 6626878 , year=

[14] [14]

arXiv preprint arXiv:2508.16153 , year=

Memento: Fine-tuning llm agents without fine-tuning llms , author=. arXiv preprint arXiv:2508.16153 , year=

work page arXiv

[15] [15]

arXiv preprint arXiv:2509.04439 , year=

Arcmemo: Abstract reasoning composition with lifelong llm memory , author=. arXiv preprint arXiv:2509.04439 , year=

work page arXiv

[16] [16]

Reinforcement Learning for Self-Improving Agent with Skill Library

Reinforcement learning for self-improving agent with skill library , author=. arXiv preprint arXiv:2512.17102 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[17] [17]

SkillWeaver: Web Agents can Self-Improve by Discovering and Honing Skills

Skillweaver: Web agents can self-improve by discovering and honing skills , author=. arXiv preprint arXiv:2504.07079 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[18] [18]

GEPA: Reflective Prompt Evolution Can Outperform Reinforcement Learning

Gepa: Reflective prompt evolution can outperform reinforcement learning , author=. arXiv preprint arXiv:2507.19457 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[19] [19]

Advances in neural information processing systems , volume=

Gradient episodic memory for continual learning , author=. Advances in neural information processing systems , volume=

[20] [20]

International Conference on Learning Representations 2022 , year=

Pretrained Language Model in Continual Learning: A Comparative Study , author=. International Conference on Learning Representations 2022 , year=

2022

[21] [21]

International Conference on Machine Learning , pages=

Agent Workflow Memory , author=. International Conference on Machine Learning , pages=. 2025 , organization=

2025

[22] [22]

SkillRL: Evolving Agents via Recursive Skill-Augmented Reinforcement Learning

Skillrl: Evolving agents via recursive skill-augmented reinforcement learning , author=. arXiv preprint arXiv:2602.08234 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[23] [23]

Advances in Neural Information Processing Systems , volume=

G-memory: Tracing hierarchical memory for multi-agent systems , author=. Advances in Neural Information Processing Systems , volume=

[24] [24]

A-MEM: Agentic Memory for LLM Agents

A-mem: Agentic memory for llm agents , author=. arXiv preprint arXiv:2502.12110 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[25] [25]

arXiv preprint arXiv:2507.23361

Swe-exp: Experience-driven software issue resolution , author=. arXiv preprint arXiv:2507.23361 , year=

work page arXiv

[26] [26]

Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

How Far Can LLMs Improve from Experience? Measuring Test-Time Learning Ability in LLMs with Human Comparison , author=. Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing , pages=

2025

[27] [27]

Proceedings of the AAAI conference on artificial intelligence , volume=

Memorybank: Enhancing large language models with long-term memory , author=. Proceedings of the AAAI conference on artificial intelligence , volume=

[28] [28]

A Comprehensive Survey of Self-Evolving AI Agents: A New Paradigm Bridging Foundation Models and Lifelong Agentic Systems

A comprehensive survey of self-evolving ai agents: A new paradigm bridging foundation models and lifelong agentic systems , author=. arXiv preprint arXiv:2508.07407 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[29] [29]

Advances in neural information processing systems , volume=

Self-refine: Iterative refinement with self-feedback , author=. Advances in neural information processing systems , volume=

[30] [30]

Mem0: Building Production-Ready AI Agents with Scalable Long-Term Memory

Mem0: Building production-ready ai agents with scalable long-term memory , author=. arXiv preprint arXiv:2504.19413 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[31] [31]

Memory-R1: Enhancing Large Language Model Agents to Manage and Utilize Memories via Reinforcement Learning

Memory-r1: Enhancing large language model agents to manage and utilize memories via reinforcement learning , author=. arXiv preprint arXiv:2508.19828 , year=

work page internal anchor Pith review Pith/arXiv arXiv

[32] [32]

Nature , volume=

Optimizing generative ai by backpropagating language model feedback , author=. Nature , volume=. 2025 , publisher=

2025

[33] [33]

International Conference on Learning Representations , year=

Tent: Fully Test-Time Adaptation by Entropy Minimization , author=. International Conference on Learning Representations , year=

[34] [34]

Advances in neural information processing systems , volume=

Memo: Test time robustness via adaptation and augmentation , author=. Advances in neural information processing systems , volume=

[35] [35]

Advances in neural information processing systems , volume=

Reflexion: Language agents with verbal reinforcement learning , author=. Advances in neural information processing systems , volume=

[36] [36]

International Conference on Learning Representations , volume=

Critic: Large language models can self-correct with tool-interactive critiquing , author=. International Conference on Learning Representations , volume=

[37] [37]

International conference on learning representations , volume=

Self-rag: Learning to retrieve, generate, and critique through self-reflection , author=. International conference on learning representations , volume=