Memory as Action: Autonomous Context Curation for Long-Horizon Agentic Tasks
Pith reviewed 2026-05-18 07:14 UTC · model grok-4.3
The pith
Framing memory management as reinforcement-learned editing actions lets a 14B model match the accuracy of models 16 times larger while halving average context length.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Memory management can be formulated as a set of learnable in-place editing actions whose policy is jointly optimized with task performance through end-to-end reinforcement learning; the resulting agent autonomously curates its working memory, preserving reasoning integrity while shrinking context size.
What carries the argument
Memory-as-Action (MemAct) framework that treats context updates as in-place editing operations (deletion, insertion) optimized by Dynamic Context Policy Optimization to enable stable RL training.
If this is right
- A 14B model reaches accuracy parity with 224B-scale models on the evaluated agentic tasks.
- Average context length drops by 51 percent while task performance is maintained.
- Learned editing strategies automatically adjust to the underlying model's capacity.
- The same policy generalizes from simpler to more complex task variants.
Where Pith is reading between the lines
- Training compute for long-horizon agents could fall because smaller base models become viable once they learn to prune their own context.
- The editing-action formulation might extend to other persistent state problems such as tool-use histories or multi-turn planning buffers.
- If the policy remains stable at even longer horizons, it could reduce reliance on fixed-length context windows in deployed agent systems.
Load-bearing premise
That reinforcement learning on in-place memory edits can be made stable and efficient without creating new reasoning failures or requiring rewards that fail to generalize across tasks.
What would settle it
A controlled run in which the same RL procedure on editing actions produces lower task accuracy or longer contexts than a strong external-memory baseline on the same long-horizon benchmarks.
read the original abstract
Long-context Large Language Models, despite their expanded capacity, require careful working memory management to mitigate attention dilution during long-horizon tasks. Yet existing approaches rely on external mechanisms that lack awareness of the agent's reasoning state, leading to suboptimal decisions. We propose Memory-as-Action (MemAct), a framework that treats working memory management as learnable policy actions. By formulating context management as in-place editing operations (deletion, insertion), MemAct enables joint optimization of information retention and task performance through end-to-end reinforcement learning. To address the computational challenges of dynamic context updates, we introduce Dynamic Context Policy Optimization, which restores training efficiency without compromising reasoning integrity. Experiments show that MemAct-RL-14B matches the accuracy of models $16\times$ larger while reducing average context length by 51\%, with learned strategies that adapt to model capabilities and generalize across task complexities.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript introduces the Memory-as-Action (MemAct) framework, which reformulates working-memory management for long-horizon agentic tasks as a reinforcement-learning policy over in-place editing actions (deletion and insertion). It further proposes Dynamic Context Policy Optimization to make end-to-end RL tractable on variable-length contexts. The central empirical claim is that the resulting MemAct-RL-14B model matches the accuracy of models 16× larger while reducing average context length by 51 %, with learned strategies that adapt to model capability and generalize across task complexities.
Significance. If the empirical results are shown to be robust, the work would constitute a meaningful contribution to efficient long-context agent design by demonstrating that joint RL optimization of retention and task reward can produce adaptive, model-aware context policies. The framing of memory management as explicit actions is conceptually clean and could be extended to other resource-constrained agent settings.
major comments (3)
- [Abstract and §4] Abstract and §4 (Experiments): the reported accuracy parity with 16× larger models and the 51 % context reduction are stated without any description of the reward function, the precise baselines, statistical significance testing, or ablations isolating Dynamic Context Policy Optimization; these omissions prevent evaluation of the central performance claim.
- [§3.2] §3.2 (Dynamic Context Policy Optimization): the method is described at a high level; a concrete account of how gradients are approximated across variable-length contexts and how the composite reward balances short-term accuracy against long-term information needs is required to assess whether the policy can avoid myopic deletions or credit-assignment collapse in long-horizon trajectories.
- [§4] §4 (Experiments): no analysis is provided of whether the learned editing policy introduces new reasoning failures (e.g., premature deletion of information required only at later steps) or whether performance gains hold under different task lengths and model scales beyond the single 14B checkpoint reported.
minor comments (2)
- [Notation] Notation: the distinction between the editing policy and the underlying LLM policy should be made explicit in every equation and algorithm box.
- [Figures] Figure captions: add error bars or confidence intervals to the context-length reduction plots so that the 51 % average can be interpreted with statistical context.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the MemAct framework's conceptual contribution. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and analyses.
read point-by-point responses
-
Referee: [Abstract and §4] Abstract and §4 (Experiments): the reported accuracy parity with 16× larger models and the 51 % context reduction are stated without any description of the reward function, the precise baselines, statistical significance testing, or ablations isolating Dynamic Context Policy Optimization; these omissions prevent evaluation of the central performance claim.
Authors: We agree that additional detail is needed for proper evaluation. In the revised manuscript, we will expand the abstract to briefly describe the reward function (task accuracy reward combined with a context-length penalty) and key baselines. Section 4 will be updated with: (i) the exact reward formulation, (ii) a complete list of baselines including model sizes and prompting methods, (iii) statistical significance results using paired t-tests across multiple random seeds, and (iv) ablations that isolate Dynamic Context Policy Optimization by comparing against ablated variants. These changes will directly address the evaluation concerns. revision: yes
-
Referee: [§3.2] §3.2 (Dynamic Context Policy Optimization): the method is described at a high level; a concrete account of how gradients are approximated across variable-length contexts and how the composite reward balances short-term accuracy against long-term information needs is required to assess whether the policy can avoid myopic deletions or credit-assignment collapse in long-horizon trajectories.
Authors: We acknowledge the high-level presentation in §3.2. The revision will add a concrete technical description: gradients are approximated via REINFORCE with a learned baseline and importance sampling to accommodate variable-length contexts after each edit action. The composite reward will be specified as a weighted sum of immediate task reward and a discounted long-term retention term that assigns value to information needed in future steps. We will also discuss how this formulation, together with eligibility traces, reduces the risk of myopic deletions and credit-assignment collapse. revision: yes
-
Referee: [§4] §4 (Experiments): no analysis is provided of whether the learned editing policy introduces new reasoning failures (e.g., premature deletion of information required only at later steps) or whether performance gains hold under different task lengths and model scales beyond the single 14B checkpoint reported.
Authors: This is a fair observation. We will add a dedicated error-analysis subsection in §4 that examines trajectories for premature deletions, quantifies their frequency and impact on final accuracy, and reports results across a wider range of task lengths. We will also include performance numbers for a 7B-scale variant to illustrate behavior at different model scales. Resource constraints prevented exhaustive testing at additional scales in the original submission, but the added experiments will strengthen the robustness claims. revision: partial
Circularity Check
No significant circularity; empirical RL results are independent of inputs
full rationale
The paper proposes the MemAct framework by formulating context curation as in-place editing actions optimized via end-to-end RL, then introduces Dynamic Context Policy Optimization to address training efficiency. The central claims (matching accuracy of 16x larger models while cutting context length 51%, with adaptive learned strategies) are presented as outcomes of experimental training and evaluation rather than any closed-form derivation or first-principles reduction. No equations or steps in the abstract or description equate a 'prediction' to a fitted parameter by construction, invoke self-citations as load-bearing uniqueness theorems, or smuggle ansatzes; the RL loop and policy are external mechanisms whose stability is an empirical question, not a definitional tautology. The work is therefore self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
invented entities (2)
-
Memory-as-Action (MemAct) framework
no independent evidence
-
Dynamic Context Policy Optimization
no independent evidence
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
memory actions can overwrite or remove earlier content from the working memory, thereby breaking the causal continuity of the trajectory... Dynamic Context Policy Optimization... segmenting trajectories at memory action points
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Forward citations
Cited by 9 Pith papers
-
Agent-BRACE: Decoupling Beliefs from Actions in Long-Horizon Tasks via Verbalized State Uncertainty
Agent-BRACE improves LLM agent performance on long-horizon partially observable tasks by 5.3-14.5% through a decoupled belief state of verbalized atomic claims with certainty labels that keeps context length constant.
-
Remember the Decision, Not the Description: A Rate-Distortion Framework for Agent Memory
Memory for long-horizon agents should preserve distinctions that affect decisions under a fixed budget, not descriptive features, yielding an exact forgetting boundary and a new online learner DeMem with regret guarantees.
-
Learning How and What to Memorize: Cognition-Inspired Two-Stage Optimization for Evolving Memory
MemCoE learns memory organization guidelines via contrastive feedback and then trains a guideline-aligned RL policy for memory updates, yielding consistent gains on personalization benchmarks.
-
ClawVM: Harness-Managed Virtual Memory for Stateful Tool-Using LLM Agents
ClawVM introduces a harness-managed virtual memory system for LLM agents that ensures deterministic residency and durability of state under token budgets by using typed pages and validated writeback.
-
PERMA: Benchmarking Personalized Memory Agents via Event-Driven Preference and Realistic Task Environments
PERMA is a new benchmark using temporally ordered events, text variability, and linguistic alignment to evaluate LLM memory agents on persona consistency beyond simple retrieval.
-
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly
Policy directives can be lost during context assembly in language model agents, leading to unprompted policy violations that SafeContext can partially prevent.
-
MemFactory: Unified Inference & Training Framework for Agent Memory
MemFactory is a new unified modular framework for memory-augmented LLM agent inference and training that integrates GRPO and reports up to 14.8% relative gains on MemAgent evaluations.
-
The Landscape of Agentic Reinforcement Learning for LLMs: A Survey
Survey that defines agentic RL for LLMs via POMDPs, introduces a taxonomy of planning/tool-use/memory/reasoning capabilities and domains, and compiles open environments from over 500 papers.
-
Ghost in the Context: Measuring Policy-Carriage Failures in Decision-Time Assembly
The paper measures policy-carriage failures during LLM context assembly and evaluates SafeContext as a partial mitigation on Llama, Qwen, and Mistral models.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.