pith. sign in

arxiv: 2605.12978 · v1 · pith:4GRNZWZPnew · submitted 2026-05-13 · 💻 cs.AI

Useful Memories Become Faulty When Continuously Updated by LLMs

Pith reviewed 2026-05-14 19:55 UTC · model grok-4.3

classification 💻 cs.AI
keywords LLM memoryagentic memoryconsolidated memoryepisodic memoryARC-AGImemory degradationself-improving agents
0
0 comments X

The pith

Consolidated memories from LLMs degrade over repeated updates and can perform worse than using no memory at all.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that LLM-based consolidation of past experiences into reusable memories starts helpful but turns faulty as updates continue, eventually dropping below the performance of agents with no memory system. This occurs because the rewriting process itself distorts useful information, shown clearly when even perfect ground-truth solutions are fed in and the model still fails on more than half of problems it once solved. Keeping raw episodic traces without any consolidation proves more stable and effective, leading the authors to recommend that agents treat original trajectories as primary evidence and only consolidate when explicitly needed rather than by default after every step.

Core claim

Consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. Even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. The regression traces to the consolidation step rather than the underlying experience, while an episodic-only control that retains the raw trajectories remains competitive.

What carries the argument

The consolidation step, in which an LLM rewrites trajectories into a continuously updated textual memory bank that replaces raw episodes.

If this is right

  • Memory utility rises initially but then falls below the no-memory baseline as updates accumulate.
  • Retaining raw episodes by default doubles accuracy compared with forced-consolidation agents in the ARC-AGI Stream environment.
  • An episodic-only regime that never consolidates matches the performance of agents that decide when to consolidate.
  • The same input trajectories produce qualitatively different memories under different update schedules.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Agent memory designs should expose explicit controls for when consolidation occurs instead of triggering it automatically after every interaction.
  • The same degradation pattern may appear in other continuous memory-update systems that rely on generative rewriting rather than simple storage.
  • Longer-term tests could measure whether episodic retention continues to outperform consolidation once the number of experiences grows by another order of magnitude.

Load-bearing premise

The observed performance drop is caused by the act of consolidation rather than by limits specific to the tested models, tasks, or update schedules.

What would settle it

Running the same ground-truth trajectories through consolidation under a different update schedule or model and checking whether failure rates on the prior solved problems stay at 54 percent or drop substantially.

read the original abstract

Learning from past experience benefits from two complementary forms of memory: episodic traces -- raw trajectories of what happened -- and consolidated abstractions distilled across many episodes into reusable, schema-like lessons. Recent agentic-memory systems pursue the consolidated form: an LLM rewrites past trajectories into a textual memory bank that it continuously updates with new interactions, promising self-improving agents without parameter updates. Yet we find that such consolidated memories produced by today's LLMs are often faulty even when derived from useful experiences. As consolidation proceeds, memory utility first rises, then degrades, and can fall below the no-memory baseline. More surprisingly, even when consolidating from ground-truth solutions, GPT-5.4 fails on 54% of a set of ARC-AGI problems it had previously solved without memory. We trace the regression to the consolidation step rather than the underlying experience: the same trajectories yield qualitatively different memories under different update schedules, and an episodic-only control that simply retains those trajectories remains competitive with the consolidators we test. In a controlled ARC-AGI Stream environment that exposes Retain, Delete, and Consolidate actions, agents preserve raw episodes by default and double the accuracy of their forced-consolidation counterparts; disabling consolidation entirely (episodic management only) matches this auto regime. Practically, robust agent memory should treat raw episodes as first-class evidence and gate consolidation explicitly rather than firing it after every interaction. Looking forward, reliable agentic memory will require LLMs that can consolidate without overwriting the evidence they depend on.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper claims that LLM-driven consolidation of memories in agentic systems turns useful episodic experiences into faulty abstractions: utility rises then falls below the no-memory baseline, and even ground-truth consolidations cause GPT-5.4 to fail on 54% of ARC-AGI problems it previously solved. In a custom ARC-AGI Stream environment exposing Retain/Delete/Consolidate actions, agents that preserve raw episodes by default outperform forced-consolidation regimes, and episodic-only management matches the best auto regime.

Significance. If the empirical controls hold, the result is significant for agent memory design: it supplies concrete evidence that continuous LLM rewriting of trajectories can overwrite evidence rather than distill it, and that raw episodic retention can be more robust than schema-like consolidation for current models. This directly informs practical choices in self-improving agents and highlights a concrete failure mode (54% regression from ground-truth) that future memory architectures must address.

major comments (2)
  1. [Methods] Methods section: the ARC-AGI Stream environment, the precise semantics of the Retain/Delete/Consolidate actions, the update schedules, and the exact procedure for generating ground-truth consolidations are not described at a level that permits reproduction of the 54% failure rate or the episodic-vs-consolidation comparison.
  2. [Results] Results (utility curves and 54% figure): the paper asserts that degradation is caused by the consolidation step itself rather than model or task idiosyncrasies, yet only a single model (GPT-5.4) is reported; an ablation across at least one additional model family is needed to support the generality of the claim that consolidation overwrites evidence.
minor comments (1)
  1. [Abstract] Abstract: the model identifier 'GPT-5.4' should be clarified (real release, internal version, or hypothetical) and the exact accuracy metric used for the 54% figure should be stated.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. These points identify key areas where additional detail and generality would strengthen the work. We respond to each major comment below and will revise the manuscript accordingly.

read point-by-point responses
  1. Referee: [Methods] Methods section: the ARC-AGI Stream environment, the precise semantics of the Retain/Delete/Consolidate actions, the update schedules, and the exact procedure for generating ground-truth consolidations are not described at a level that permits reproduction of the 54% failure rate or the episodic-vs-consolidation comparison.

    Authors: We agree that the current Methods section lacks the level of detail required for full reproducibility. In the revised manuscript we will substantially expand this section to provide: (1) a complete specification of the ARC-AGI Stream environment, (2) the precise semantics and state-transition rules for the Retain, Delete, and Consolidate actions, (3) the exact update schedules and triggering conditions used in each experimental regime, and (4) the step-by-step procedure employed to generate the ground-truth consolidations. These additions will enable independent replication of the reported 54% failure rate and the episodic-versus-consolidation comparisons. revision: yes

  2. Referee: [Results] Results (utility curves and 54% figure): the paper asserts that degradation is caused by the consolidation step itself rather than model or task idiosyncrasies, yet only a single model (GPT-5.4) is reported; an ablation across at least one additional model family is needed to support the generality of the claim that consolidation overwrites evidence.

    Authors: The referee is correct that our primary results are reported for a single model. While GPT-5.4 was selected because it is the strongest available model on the ARC-AGI tasks we study, we acknowledge that this limits claims about generality. In the revision we will add an ablation using at least one additional model family and will report the corresponding utility curves and failure rates. This will allow us to assess whether the observed degradation from consolidation is idiosyncratic to GPT-5.4 or holds more broadly, thereby strengthening the evidence that the consolidation process itself can overwrite useful episodic evidence. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical study measuring LLM agent performance on ARC-AGI tasks under different memory policies (episodic retention vs. continuous consolidation). No derivation chain, equations, or fitted parameters exist that reduce a claimed result to its own inputs by construction. Central claims rest on direct experimental comparisons (e.g., accuracy drops under forced consolidation, episodic baselines remaining competitive) against external task metrics and controlled schedules. Self-citations, if present, are not load-bearing for any uniqueness theorem or ansatz; results are falsifiable via replication on the described environment.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

The central claim rests on the domain assumption that LLM rewriting is the primary mechanism for memory consolidation in current agent systems and that ARC-AGI performance is a valid proxy for general agent utility. No free parameters or invented entities are introduced.

axioms (2)
  • domain assumption LLM-based rewriting of trajectories produces consolidated memories that agents can use for future decisions
    Invoked throughout the abstract as the mechanism being tested.
  • domain assumption ARC-AGI problems and the Stream environment are representative of the memory demands faced by real agents
    Used to generalize from the reported experiments to broader agent design.

pith-pipeline@v0.9.0 · 5586 in / 1563 out tokens · 59761 ms · 2026-05-14T19:55:18.938103+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

15 extracted references · 15 canonical work pages

  1. [1]

    O'Brien and Carrie Jun Cai and Meredith Ringel Morris and Percy Liang and Michael S

    URLhttps://arxiv.org/abs/2511.00162. Morris Moscovitch, Roberto Cabeza, Gordon Winocur, and Lynn Nadel. Episodic memory and beyond: The hippocampus and neocortex in transformation.Annual Review of Psychology, 67:105–134, 2016. Arseny Moskvichev, Victor Vikram Odouard, and Melanie Mitchell. The conceptarc benchmark: Evaluating understand- ing and generaliz...

  2. [2]

    You may RETAIN entries by index, MERGE several into a cleaner entry, or DROP entries by omitting them from the output

    The current strategy buffer (1-based indices 1..N). You may RETAIN entries by index, MERGE several into a cleaner entry, or DROP entries by omitting them from the output

  3. [3]

    from_existing

    K input tasks (1-based indices 1..K), each with description, 5 sample IO pairs, and reference solution code. Goal: produce the **full replacement strategy buffer** as a JSON list of entries. Each entry is exactly one of: - Retain unchanged: {"from_existing": [i, j, ...]} Lists >=1 existing indices; each listed index becomes its own kept-as-is entry. NO ot...

  4. [4]

    when_to_use

    When to use: The task has two same-sized input grids and the output has the same height but double the width, arranged as a left-right concatenation. The left half reproduces the shape pattern from the first input but normalizes every non-background cell to one single fill color, while the right half copies the second input unchanged. Strategy: (1) Verify...

  5. [5]

    reason" in your reply). You MUST pick one existing strategy -- no other action is accepted: B) **Use an existing strategy**: {

    Extract connected objects, choose the largest as a frame, classify other objects by whether their bounding boxes lie strictly inside that frame, erase the frame and all outside objects, then hollow out each inside object in place by turning its interior to color 0 while preserving its boundary. --- 4. Choose strategy --- Options (include a short "reason" ...

  6. [6]

    unified loop

    Pitfalls to avoid: forgetting to re-pick items that started on the destination; forgetting to reopen microwaves or containers when needed; forgetting to retrieve heated/cooled/cleaned objects before placing. 53 Useful Memories Become Faulty When Continuously Updated by LLMs Collapse modes (1) Over-merge into a single “unified loop”.The50 structured items ...

  7. [7]

    Search: men’s dress shirt cotton spandex classic fit short sleeve machine wash melon berry 6x under 60

  8. [8]

    Open best-looking product

  9. [9]

    Verify category = men’s dress shirt

  10. [10]

    Verify fabric = cotton spandex

  11. [11]

    Verify fit = classic fit

  12. [12]

    Verify sleeve = short sleeve

  13. [13]

    Verify care = machine wash

  14. [14]

    Verify color = melon berry

  15. [15]

    melon berry,

    Buy only after all checks pass </example> ** Count: 1 </memory_item> <memory_item> <description> Exact color variants are non-substitutable. If the task requests a precise color name like “melon berry,” do not accept nearby pink/red/coral shades, generic berry tones, or other “melon” variants unless the swatch label matches exactly. (Reference: current in...