Useful Memories Become Faulty When Continuously Updated by LLMs
Pith reviewed 2026-05-14 19:55 UTC · model grok-4.3
The pith
Consolidated memories from LLMs degrade over repeated updates and can perform worse than using no memory at all.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. Even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. The regression traces to the consolidation step rather than the underlying experience, while an episodic-only control that retains the raw trajectories remains competitive.
What carries the argument
The consolidation step, in which an LLM rewrites trajectories into a continuously updated textual memory bank that replaces raw episodes.
If this is right
- Memory utility rises initially but then falls below the no-memory baseline as updates accumulate.
- Retaining raw episodes by default doubles accuracy compared with forced-consolidation agents in the ARC-AGI Stream environment.
- An episodic-only regime that never consolidates matches the performance of agents that decide when to consolidate.
- The same input trajectories produce qualitatively different memories under different update schedules.
Where Pith is reading between the lines
- Agent memory designs should expose explicit controls for when consolidation occurs instead of triggering it automatically after every interaction.
- The same degradation pattern may appear in other continuous memory-update systems that rely on generative rewriting rather than simple storage.
- Longer-term tests could measure whether episodic retention continues to outperform consolidation once the number of experiences grows by another order of magnitude.
Load-bearing premise
The observed performance drop is caused by the act of consolidation rather than by limits specific to the tested models, tasks, or update schedules.
What would settle it
Running the same ground-truth trajectories through consolidation under a different update schedule or model and checking whether failure rates on the prior solved problems stay at 54 percent or drop substantially.
read the original abstract
Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that LLM-driven consolidation of memories in agentic systems turns useful episodic experiences into faulty abstractions: utility rises then falls below the no-memory baseline, and even ground-truth consolidations cause GPT-5.4 to fail on 54% of ARC-AGI problems it previously solved. In a custom ARC-AGI Stream environment exposing Retain/Delete/Consolidate actions, agents that preserve raw episodes by default outperform forced-consolidation regimes, and episodic-only management matches the best auto regime.
Significance. If the empirical controls hold, the result is significant for agent memory design: it supplies concrete evidence that continuous LLM rewriting of trajectories can overwrite evidence rather than distill it, and that raw episodic retention can be more robust than schema-like consolidation for current models. This directly informs practical choices in self-improving agents and highlights a concrete failure mode (54% regression from ground-truth) that future memory architectures must address.
major comments (2)
- [Methods] Methods section: the ARC-AGI Stream environment, the precise semantics of the Retain/Delete/Consolidate actions, the update schedules, and the exact procedure for generating ground-truth consolidations are not described at a level that permits reproduction of the 54% failure rate or the episodic-vs-consolidation comparison.
- [Results] Results (utility curves and 54% figure): the paper asserts that degradation is caused by the consolidation step itself rather than model or task idiosyncrasies, yet only a single model (GPT-5.4) is reported; an ablation across at least one additional model family is needed to support the generality of the claim that consolidation overwrites evidence.
minor comments (1)
- [Abstract] Abstract: the model identifier 'GPT-5.4' should be clarified (real release, internal version, or hypothetical) and the exact accuracy metric used for the 54% figure should be stated.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. These points identify key areas where additional detail and generality would strengthen the work. We respond to each major comment below and will revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Methods] Methods section: the ARC-AGI Stream environment, the precise semantics of the Retain/Delete/Consolidate actions, the update schedules, and the exact procedure for generating ground-truth consolidations are not described at a level that permits reproduction of the 54% failure rate or the episodic-vs-consolidation comparison.
Authors: We agree that the current Methods section lacks the level of detail required for full reproducibility. In the revised manuscript we will substantially expand this section to provide: (1) a complete specification of the ARC-AGI Stream environment, (2) the precise semantics and state-transition rules for the Retain, Delete, and Consolidate actions, (3) the exact update schedules and triggering conditions used in each experimental regime, and (4) the step-by-step procedure employed to generate the ground-truth consolidations. These additions will enable independent replication of the reported 54% failure rate and the episodic-versus-consolidation comparisons. revision: yes
-
Referee: [Results] Results (utility curves and 54% figure): the paper asserts that degradation is caused by the consolidation step itself rather than model or task idiosyncrasies, yet only a single model (GPT-5.4) is reported; an ablation across at least one additional model family is needed to support the generality of the claim that consolidation overwrites evidence.
Authors: The referee is correct that our primary results are reported for a single model. While GPT-5.4 was selected because it is the strongest available model on the ARC-AGI tasks we study, we acknowledge that this limits claims about generality. In the revision we will add an ablation using at least one additional model family and will report the corresponding utility curves and failure rates. This will allow us to assess whether the observed degradation from consolidation is idiosyncratic to GPT-5.4 or holds more broadly, thereby strengthening the evidence that the consolidation process itself can overwrite useful episodic evidence. revision: yes
Circularity Check
No significant circularity
full rationale
The paper is an empirical study measuring LLM agent performance on ARC-AGI tasks under different memory policies (episodic retention vs. continuous consolidation). No derivation chain, equations, or fitted parameters exist that reduce a claimed result to its own inputs by construction. Central claims rest on direct experimental comparisons (e.g., accuracy drops under forced consolidation, episodic baselines remaining competitive) against external task metrics and controlled schedules. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz; results are falsifiable via replication on the described environment.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LLM-based rewriting of trajectories produces consolidated memories that agents can use for future decisions
- domain assumption ARC-AGI problems and the Stream environment are representative of the memory demands faced by real agents
Reference graph
Works this paper leans on
-
[1]
O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S
URLhttps://arxiv.org/abs/2511.00162. Morris Moscovitch, Roberto Cabeza, Gordon Winocur, and Lynn Nadel. Episodic memory and beyond: The hippocampus and neocortex in transformation.Annual Review of Psychology, 67:105–134, 2016. Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The conceptarc benchmark: Evaluating understand- ing and generaliz...
-
[2]
The current strategy buffer (1-based indices 1..N). You may RETAIN entries by index, MERGE several into a cleaner entry, or DROP entries by omitting them from the output
-
[3]
K input tasks (1-based indices 1..K), each with description, 5 sample IO pairs, and reference solution code. Goal: produce the **full replacement strategy buffer** as a JSON list of entries. Each entry is exactly one of: - Retain unchanged: {"from_existing": [i, j, ...]} Lists >=1 existing indices; each listed index becomes its own kept-as-is entry. NO ot...
-
[4]
When to use: The task has two same-sized input grids and the output has the same height but double the width, arranged as a left-right concatenation. The left half reproduces the shape pattern from the first input but normalizes every non-background cell to one single fill color, while the right half copies the second input unchanged. Strategy: (1) Verify...
-
[5]
Extract connected objects, choose the largest as a frame, classify other objects by whether their bounding boxes lie strictly inside that frame, erase the frame and all outside objects, then hollow out each inside object in place by turning its interior to color 0 while preserving its boundary. --- 4. Choose strategy --- Options (include a short "reason" ...
work page 2024
-
[6]
Pitfalls to avoid: forgetting to re-pick items that started on the destination; forgetting to reopen microwaves or containers when needed; forgetting to retrieve heated/cooled/cleaned objects before placing. 53 Useful Memories Become Faulty When Continuously Updated by LLMs Collapse modes (1) Over-merge into a single “unified loop”.The50 structured items ...
work page 2022
-
[7]
Search: men’s dress shirt cotton spandex classic fit short sleeve machine wash melon berry 6x under 60
-
[8]
Open best-looking product
-
[9]
Verify category = men’s dress shirt
-
[10]
Verify fabric = cotton spandex
-
[11]
Verify fit = classic fit
-
[12]
Verify sleeve = short sleeve
-
[13]
Verify care = machine wash
-
[14]
Verify color = melon berry
-
[15]
Buy only after all checks pass </example> ** Count: 1 </memory_item> <memory_item> <description> Exact color variants are non-substitutable. If the task requests a precise color name like “melon berry,” do not accept nearby pink/red/coral shades, generic berry tones, or other “melon” variants unless the swatch label matches exactly. (Reference: current in...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.