Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

Jiangming Shu; Jitao Sang; Shangxi Wu; Xueyuan Lin; Ye Ma; Yuxiang Zhang

arxiv: 2510.12635 · v3 · submitted 2025-10-14 · 💻 cs.AI

Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks

Yuxiang Zhang , Jiangming Shu , Ye Ma , Xueyuan Lin , Shangxi Wu , Jitao Sang This is my paper

Pith reviewed 2026-05-18 07:14 UTC · model grok-4.3

classification 💻 cs.AI

keywords memory managementlong-context LLMsreinforcement learningcontext curationagentic tasksworking memorypolicy optimizationin-place editing

0 comments

The pith

Framing memory management as reinforcement-learned editing actions lets a 14B model match the accuracy of models 16 times larger while halving average context length.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper argues that long-context language models struggle with attention dilution in extended tasks because external memory mechanisms ignore the agent's internal reasoning state. By recasting context curation as in-place editing operations—deletion and insertion—that the model learns to perform as policy actions, MemAct turns memory management into an end-to-end optimizable process via reinforcement learning. Dynamic Context Policy Optimization is introduced to keep this training efficient and stable. Experiments show the resulting 14B model reaches parity with much larger systems on long-horizon agentic tasks while using 51 percent less context on average, and the learned policies adapt to different model sizes and task difficulties.

Core claim

Memory management can be formulated as a set of learnable in-place editing actions whose policy is jointly optimized with task performance through end-to-end reinforcement learning; the resulting agent autonomously curates its working memory, preserving reasoning integrity while shrinking context size.

What carries the argument

Memory-as-Action (MemAct) framework that treats context updates as in-place editing operations (deletion, insertion) optimized by Dynamic Context Policy Optimization to enable stable RL training.

If this is right

A 14B model reaches accuracy parity with 224B-scale models on the evaluated agentic tasks.
Average context length drops by 51 percent while task performance is maintained.
Learned editing strategies automatically adjust to the underlying model's capacity.
The same policy generalizes from simpler to more complex task variants.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Training compute for long-horizon agents could fall because smaller base models become viable once they learn to prune their own context.
The editing-action formulation might extend to other persistent state problems such as tool-use histories or multi-turn planning buffers.
If the policy remains stable at even longer horizons, it could reduce reliance on fixed-length context windows in deployed agent systems.

Load-bearing premise

That reinforcement learning on in-place memory edits can be made stable and efficient without creating new reasoning failures or requiring rewards that fail to generalize across tasks.

What would settle it

A controlled run in which the same RL procedure on editing actions produces lower task accuracy or longer contexts than a strong external-memory baseline on the same long-horizon benchmarks.

read the original abstract

Long-context Large Language Models, despite their expanded capacity, require careful working memory management to mitigate attention dilution during long-horizon tasks. Yet existing approaches rely on external mechanisms that lack awareness of the agent's reasoning state, leading to suboptimal decisions. We propose Memory-as-Action (MemAct), a framework that treats working memory management as learnable policy actions. By formulating context management as in-place editing operations (deletion, insertion), MemAct enables joint optimization of information retention and task performance through end-to-end reinforcement learning. To address the computational challenges of dynamic context updates, we introduce Dynamic Context Policy Optimization, which restores training efficiency without compromising reasoning integrity. Experiments show that MemAct-RL-14B matches the accuracy of models $16\times$ larger while reducing average context length by 51\%, with learned strategies that adapt to model capabilities and generalize across task complexities.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

This paper frames context curation as RL actions for in-place edits and shows a 14B model matching much larger ones with 51% shorter contexts, though RL details remain light.

read the letter

The main thing here is that the authors turn memory management into explicit RL actions—deletion and insertion—that get optimized jointly with task performance instead of relying on external heuristics or fixed rules. This lets the policy learn retention based on the agent's current reasoning state rather than some separate module. They add Dynamic Context Policy Optimization to keep training efficient when context lengths change on the fly. The reported outcome is concrete: their 14B model holds accuracy against models sixteen times larger while cutting average context length by half, and the strategies seem to shift with model size and task complexity. That is the part worth noting first. The framing moves beyond prior summarization or retrieval methods by making the edits themselves learnable and end-to-end with the reward. On the soft spots, the description of the reward design, baseline choices, and ablations for the policy optimization step is still thin. Long-horizon RL often runs into credit assignment problems, and if the reward does not properly weight future information needs, the policy could favor aggressive early deletions that look good short-term but hurt later steps. The stress-test concern about myopic behavior or policy collapse does not get fully ruled out by the high-level claims, so that remains a point to check. This is the sort of work that matters for people building practical long-horizon agents who want to avoid external memory crutches and keep everything inside the model. Readers working on RL for LLMs or efficient context use would find the empirical angle useful. It has a distinct enough idea and specific enough numbers to deserve a serious referee, even if the RL implementation needs more scrutiny. I would send it out for peer review and ask the authors to expand on the reward formulation and any checks for long-term coherence.

Referee Report

3 major / 2 minor

Summary. The manuscript introduces the Memory-as-Action (MemAct) framework, which reformulates working-memory management for long-horizon agentic tasks as a reinforcement-learning policy over in-place editing actions (deletion and insertion). It further proposes Dynamic Context Policy Optimization to make end-to-end RL tractable on variable-length contexts. The central empirical claim is that the resulting MemAct-RL-14B model matches the accuracy of models 16× larger while reducing average context length by 51 %, with learned strategies that adapt to model capability and generalize across task complexities.

Significance. If the empirical results are shown to be robust, the work would constitute a meaningful contribution to efficient long-context agent design by demonstrating that joint RL optimization of retention and task reward can produce adaptive, model-aware context policies. The framing of memory management as explicit actions is conceptually clean and could be extended to other resource-constrained agent settings.

major comments (3)

[Abstract and §4] Abstract and §4 (Experiments): the reported accuracy parity with 16× larger models and the 51 % context reduction are stated without any description of the reward function, the precise baselines, statistical significance testing, or ablations isolating Dynamic Context Policy Optimization; these omissions prevent evaluation of the central performance claim.
[§3.2] §3.2 (Dynamic Context Policy Optimization): the method is described at a high level; a concrete account of how gradients are approximated across variable-length contexts and how the composite reward balances short-term accuracy against long-term information needs is required to assess whether the policy can avoid myopic deletions or credit-assignment collapse in long-horizon trajectories.
[§4] §4 (Experiments): no analysis is provided of whether the learned editing policy introduces new reasoning failures (e.g., premature deletion of information required only at later steps) or whether performance gains hold under different task lengths and model scales beyond the single 14B checkpoint reported.

minor comments (2)

[Notation] Notation: the distinction between the editing policy and the underlying LLM policy should be made explicit in every equation and algorithm box.
[Figures] Figure captions: add error bars or confidence intervals to the context-length reduction plots so that the 51 % average can be interpreted with statistical context.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the MemAct framework's conceptual contribution. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and analyses.

read point-by-point responses

Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported accuracy parity with 16× larger models and the 51 % context reduction are stated without any description of the reward function, the precise baselines, statistical significance testing, or ablations isolating Dynamic Context Policy Optimization; these omissions prevent evaluation of the central performance claim.

Authors: We agree that additional detail is needed for proper evaluation. In the revised manuscript, we will expand the abstract to briefly describe the reward function (task accuracy reward combined with a context-length penalty) and key baselines. Section 4 will be updated with: (i) the exact reward formulation, (ii) a complete list of baselines including model sizes and prompting methods, (iii) statistical significance results using paired t-tests across multiple random seeds, and (iv) ablations that isolate Dynamic Context Policy Optimization by comparing against ablated variants. These changes will directly address the evaluation concerns. revision: yes
Referee: [§3.2] §3.2 (Dynamic Context Policy Optimization): the method is described at a high level; a concrete account of how gradients are approximated across variable-length contexts and how the composite reward balances short-term accuracy against long-term information needs is required to assess whether the policy can avoid myopic deletions or credit-assignment collapse in long-horizon trajectories.

Authors: We acknowledge the high-level presentation in §3.2. The revision will add a concrete technical description: gradients are approximated via REINFORCE with a learned baseline and importance sampling to accommodate variable-length contexts after each edit action. The composite reward will be specified as a weighted sum of immediate task reward and a discounted long-term retention term that assigns value to information needed in future steps. We will also discuss how this formulation, together with eligibility traces, reduces the risk of myopic deletions and credit-assignment collapse. revision: yes
Referee: [§4] §4 (Experiments): no analysis is provided of whether the learned editing policy introduces new reasoning failures (e.g., premature deletion of information required only at later steps) or whether performance gains hold under different task lengths and model scales beyond the single 14B checkpoint reported.

Authors: This is a fair observation. We will add a dedicated error-analysis subsection in §4 that examines trajectories for premature deletions, quantifies their frequency and impact on final accuracy, and reports results across a wider range of task lengths. We will also include performance numbers for a 7B-scale variant to illustrate behavior at different model scales. Resource constraints prevented exhaustive testing at additional scales in the original submission, but the added experiments will strengthen the robustness claims. revision: partial

Circularity Check

0 steps flagged

No significant circularity; empirical RL results are independent of inputs

full rationale

The paper proposes the MemAct framework by formulating context curation as in-place editing actions optimized via end-to-end RL, then introduces Dynamic Context Policy Optimization to address training efficiency. The central claims (matching accuracy of 16x larger models while cutting context length 51%, with adaptive learned strategies) are presented as outcomes of experimental training and evaluation rather than any closed-form derivation or first-principles reduction. No equations or steps in the abstract or description equate a 'prediction' to a fitted parameter by construction, invoke self-citations as load-bearing uniqueness theorems, or smuggle ansatzes; the RL loop and policy are external mechanisms whose stability is an empirical question, not a definitional tautology. The work is therefore self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 2 invented entities

The central claim rests on the unstated premise that an RL policy can discover memory strategies that are both task-effective and computationally tractable; no external benchmarks or formal guarantees are supplied in the abstract.

invented entities (2)

Memory-as-Action (MemAct) framework no independent evidence
purpose: Treats working memory management as learnable policy actions
New construct introduced to enable joint optimization of retention and performance
Dynamic Context Policy Optimization no independent evidence
purpose: Restores training efficiency for dynamic context updates
Invented training procedure to address computational challenges of in-place edits

pith-pipeline@v0.9.0 · 5690 in / 1211 out tokens · 25322 ms · 2026-05-18T07:14:43.987590+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

memory actions can overwrite or remove earlier content from the working memory, thereby breaking the causal continuity of the trajectory... Dynamic Context Policy Optimization... segmenting trajectories at memory action points

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 9 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
cs.CL 2026-05 unverdicted novelty 8.0

Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
cs.AI 2026-05 unverdicted novelty 7.0

Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
cs.CL 2026-05 unverdicted novelty 7.0

MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
cs.AI 2026-04 unverdicted novelty 7.0

ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
cs.AI 2026-03 unverdicted novelty 7.0

PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly
cs.CR 2026-05 unverdicted novelty 6.0

Policy directives can be lost during context assembly in language model agents, leading to unprompted policy violations that SafeContext can partially prevent.
MemFactory: Unified Inference & Training Framework for Agent Memory
cs.CL 2026-03 unverdicted novelty 6.0

MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
cs.AI 2025-09 accept novelty 6.0

Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly
cs.CR 2026-05 unverdicted novelty 5.0

The paper measures policy-carriage failures during LLM context assembly and evaluates SafeContext as a partial mitigation on Llama, Qwen, and Mistral models.