pith. sign in

arxiv: 2606.25449 · v1 · pith:6G2QJB3Unew · submitted 2026-06-24 · 💻 cs.CL · cs.AI· cs.LG

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

Pith reviewed 2026-06-25 21:04 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords brittle memoryreclaim evaluationsource-first compressionmemory in language modelslossy memorycorrectabilitydialogue systems
0
0 comments X

The pith

A language model's lossy memory produces confident wrong answers where an empty memory would abstain.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that when a model retains a drifted conclusion but drops the source data that produced it, the model treats the stale value as correct and refuses correction, whereas an empty memory leads it to abstain. This pattern holds in the same direction across seven models with no reversals. Reclaim evaluation tests the effect by compressing an interaction at fixed budget then measuring whether a correction restores the known answer, scored exactly against ground truth. A source-first compression policy that keeps the original data and drops only the re-derivable conclusion restores correctability at the same budget. The problem scales in memory loops because one dropped-source error becomes uncorrectable and contaminates later steps.

Core claim

Brittle memory is the consistent behavioral pattern in which keeping a wrong conclusion after dropping its source is worse than keeping nothing at all; across seven models the direction never reverses. Reclaim evaluation measures this by applying fixed-budget compression to a drifted interaction, then testing whether a correction recovers the known answer, with exact scoring against ground truth and no judge. Correctability depends on whether the answer-determining source survives compression rather than on model capability. A one-line source-first policy restores correctability where the source is compact and identifiable, and this replicates across three deployed memory systems and on real

What carries the argument

Reclaim evaluation: a fixed-budget compression followed by a correction test that scores recovery of a known answer against ground truth without a judge.

If this is right

  • Chained memory loops turn one dropped-source error into an uncorrectable corruption that grows across downstream steps.
  • Source-first compression holds error span to a bounded budget horizon while length-matched random compression does not.
  • Past the point where the source no longer fits the budget, the source-first fix fails unless the note explicitly records that it is incomplete.
  • The effect appears in real dialogue data and in three different deployed memory systems.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The same source-survival requirement would apply to any retrieval-augmented system that must later revise earlier steps.
  • A completeness flag in memory notes would let downstream steps detect and handle truncated sources before they propagate.
  • Exact-match scoring against a pre-known answer could be adapted to other tasks where the correct final state is defined in advance.

Load-bearing premise

The test treats the known answer as unambiguous ground truth and assumes the compression-plus-correction setup isolates source loss without other confounding effects from model capability or prompt wording.

What would settle it

A single model and task combination in which a lossy memory that retains the wrong conclusion but drops the source produces a higher rate of correctable answers than an empty memory under the same compression budget.

Figures

Figures reproduced from arXiv: 2606.25449 by Alex Kwon.

Figure 1
Figure 1. Figure 1: Compression decides whether an error stays fixable. A model drifts in session 1; only a compressed memory crosses into session 2, at a fixed budget. Under lossy compression the memory keeps the salient wrong conclusion and discards the source, so a later correction has nothing to recompute from, and the model does not abstain, it confidently returns the stale wrong value. Under source-first compression the… view at source ↗
Figure 2
Figure 2. Figure 2: The boundary of the source-first law. Directed Reclaim Rate vs. ledger size N at two fixed memory budgets B, n=24/point, 95% bootstrap CI. source-first (solid, llama-3.1-8b) holds while the N-item source fits B, then drops to the budget-matched lossy-padded floor (dashed) the instant any item must be dropped. The cliff moves right with the budget (N=5→14 as B doubles), so the lever is whether the answer-de… view at source ↗
Figure 3
Figure 3. Figure 3: Noise crowds the source out of a fixed budget. Directed Reclaim Rate vs. decoy count added to a four-item source at a fixed budget, n=24/point, 95% CI. Naive (positional) source-first (red) decays to the lossy floor as decoys eat the budget; relevance-aware denoised source-first (green) holds flat. The frontier confirm (dotted) coincides with the 8B model: the noise cliff is capability-invariant, because a… view at source ↗
read the original abstract

A language model's memory can be worse than having no memory at all. Give a model a memory that kept a wrong conclusion but dropped the work behind it, and it emits that stale value as a confident answer; give the same model an empty memory and it abstains. Across seven models this direction never reverses, a clean kill condition that none breaks. We call this brittle memory: behavioral, not the near-immediate information bound beneath it; only its magnitude is disposition- and task-dependent, not its direction. We measure it with reclaim evaluation: compress a drifted interaction at a fixed budget, then test whether a correction recovers the known answer, scored against ground truth with no judge. Correctability is bottlenecked by whether the answer-determining source survives, not by capability. A one-line source-first policy (keep the recomputable source, drop the re-derivable conclusion) restores correctability at equal budget where that source is compact and identifiable; a length-matched control rules out added text as the cause. The hand-built oracle reaches 1.00; a one-prompt deployable version reclaims 0.49-0.88. The stake compounds: chained through a memory loop, a single dropped-source error corrupts a growing span of downstream steps and stays uncorrectable, while source-first holds to a bounded budget horizon. The wall and fix replicate across three deployed memory systems and on real dialogue (MultiWOZ), and past the budget where the source no longer fits, the fix fails silently unless the note records completeness. This is a controlled study of a mechanism, not a benchmark: judge-free exact scoring, matched-budget controls, and validators built to come out false. We release the harness, conditions, and validators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper claims that lossy memory in LLMs—retaining a drifted conclusion while dropping its source—produces confident incorrect answers, whereas empty memory leads to abstention. This 'brittle memory' effect holds in the same direction across seven models. Reclaim evaluation measures it via fixed-budget compression of drifted interactions followed by a correction test, scored by exact match to known ground truth with no judge. A source-first policy (keep recomputable source, drop re-derivable conclusion) restores correctability at matched budget; an oracle reaches 1.00 and a one-prompt version reaches 0.49-0.88. The effect and fix replicate on three deployed memory systems and MultiWOZ; a length-matched control rules out added text. The harness, conditions, and validators are released.

Significance. If the central claim holds, the work identifies a concrete mechanism by which partial memory can actively degrade performance below the no-memory baseline in memory-augmented systems, with a simple, deployable mitigation. The release of the evaluation harness, conditions, and validators is a clear strength, enabling direct reproduction and extension. The judge-free exact-match scoring and matched-budget controls further support falsifiability of the proposed mechanism.

major comments (2)
  1. [Abstract] Abstract: the headline claim that 'this direction never reverses' across seven models is load-bearing for the brittle-memory thesis, yet the design does not report a control that holds source presence fixed while varying only the correction prompt or model-family prompt handling; without this, uniformity could reflect harness artifacts rather than a general behavioral property.
  2. [Abstract] Abstract (reclaim evaluation description): the fixed-budget compression plus correction test is asserted to isolate whether the answer-determining source survives, but no quantitative check is described that the correction prompt's effectiveness is independent of model capability or disposition when source presence is controlled; this assumption underpins the 'correctability is bottlenecked by source survival' conclusion.
minor comments (2)
  1. [Abstract] Abstract: the seven models and three deployed memory systems are not named; listing them would improve reproducibility.
  2. [Abstract] Abstract: 'brittle memory' is defined behaviorally but could usefully contrast it with existing terms such as 'hallucination' or 'catastrophic forgetting' to clarify novelty.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on the abstract claims. We address each point below and will revise the manuscript to incorporate additional controls and clarifications as indicated.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the headline claim that 'this direction never reverses' across seven models is load-bearing for the brittle-memory thesis, yet the design does not report a control that holds source presence fixed while varying only the correction prompt or model-family prompt handling; without this, uniformity could reflect harness artifacts rather than a general behavioral property.

    Authors: We agree that an explicit control holding source presence fixed while varying only the correction prompt (or model-family prompt handling) would more rigorously exclude the possibility of harness artifacts. The current experiments vary models across families but apply a fixed correction prompt. While the consistency of direction across architecturally diverse models provides supporting evidence that the effect is not prompt- or harness-specific, we acknowledge the gap. In revision we will add a dedicated control experiment (and corresponding discussion) that fixes source presence and systematically varies prompt elements and model families. revision: yes

  2. Referee: [Abstract] Abstract (reclaim evaluation description): the fixed-budget compression plus correction test is asserted to isolate whether the answer-determining source survives, but no quantitative check is described that the correction prompt's effectiveness is independent of model capability or disposition when source presence is controlled; this assumption underpins the 'correctability is bottlenecked by source survival' conclusion.

    Authors: We agree that a direct quantitative check of correction-prompt effectiveness (with source presence controlled) would strengthen the isolation of source survival as the bottleneck. The manuscript reports that an oracle providing the source reaches 1.00 and that the source-first policy improves correctability at matched budget, but does not include the requested cross-model measurement with source fixed. In the revision we will add this quantitative check, reporting correction success rates across the seven models when source presence is assured. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical measurements with released controls

full rationale

The paper reports direct experimental measurements of model behavior under fixed-budget compression and correction prompts, using exact-match scoring against a known ground-truth answer, length-matched controls, and a released harness. No equations, fitted parameters, or derivations are presented that reduce by construction to their own inputs. Claims such as 'direction never reverses across seven models' are stated as observed outcomes of the evaluation protocol rather than predictions derived from prior fits or self-citations. The design is self-contained against external benchmarks and does not rely on load-bearing self-citation chains or ansatzes smuggled via prior work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only; no explicit free parameters, axioms, or invented entities are described. The evaluation relies on an implicit assumption that ground-truth answers exist and are stable, but none are formalized.

pith-pipeline@v0.9.1-grok · 5845 in / 1059 out tokens · 23437 ms · 2026-06-25T21:04:59.686849+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

39 extracted references · 3 linked inside Pith

  1. [1]

    Great, Now Write an Article About That: The Crescendo Multi-Turn

    Russinovich, Mark and Salem, Ahmed and Eldan, Ronen , booktitle=. Great, Now Write an Article About That: The Crescendo Multi-Turn

  2. [3]

    International Conference on Learning Representations (ICLR) , year=

    Towards Understanding Sycophancy in Language Models , author=. International Conference on Learning Representations (ICLR) , year=

  3. [4]

    International Conference on Learning Representations (ICLR) , year=

    Large Language Models Cannot Self-Correct Reasoning Yet , author=. International Conference on Learning Representations (ICLR) , year=

  4. [5]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Self-Refine: Iterative Refinement with Self-Feedback , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  5. [6]

    Advances in Neural Information Processing Systems (NeurIPS) , year=

    Reflexion: Language Agents with Verbal Reinforcement Learning , author=. Advances in Neural Information Processing Systems (NeurIPS) , year=

  6. [7]

    and Stoica, Ion and Gonzalez, Joseph E

    Packer, Charles and Wooders, Sarah and Lin, Kevin and Fang, Vivian and Patil, Shishir G. and Stoica, Ion and Gonzalez, Joseph E. , journal=

  7. [8]

    Retrieval-Augmented Generation for Knowledge-Intensive

    Lewis, Patrick and Perez, Ethan and Piktus, Aleksandra and others , booktitle=. Retrieval-Augmented Generation for Knowledge-Intensive

  8. [9]

    Locating and Editing Factual Associations in

    Meng, Kevin and Bau, David and Andonian, Alex and Belinkov, Yonatan , booktitle=. Locating and Editing Factual Associations in

  9. [10]

    Transactions of the Association for Computational Linguistics , volume=

    Lost in the Middle: How Language Models Use Long Contexts , author=. Transactions of the Association for Computational Linguistics , volume=

  10. [11]

    , booktitle=

    Pennington, Jeffrey and Socher, Richard and Manning, Christopher D. , booktitle=

  11. [12]

    Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

    Budzianowski, Pawe. Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP) , pages=

  12. [13]

    2024 , howpublished =

    Introducing. 2024 , howpublished =

  13. [14]

    2026 , howpublished =

  14. [15]

    2024 , howpublished =

    mem0: The Memory Layer for. 2024 , howpublished =

  15. [16]

    2022 , howpublished =

  16. [17]

    Xiao, Shitao and Liu, Zheng and Zhang, Peitian and Muennighoff, Niklas , journal =

  17. [18]

    Zeng, Aohan and Liu, Mingdao and Lu, Rui and Wang, Bowen and Liu, Xiao and Dong, Yuxiao and Tang, Jie , journal =

  18. [19]

    glaive-function-calling-v2 , year =

  19. [20]

    Claude Opus 4.8

    Anthropic . Claude Opus 4.8. https://www.anthropic.com/news/claude-opus-4-8, 2026 a . Model announcement

  20. [21]

    Claude Sonnet 4.6

    Anthropic . Claude Sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6, 2026 b . Model announcement

  21. [22]

    MultiWOZ -- a large-scale multi-domain W izard-of- O z dataset for task-oriented dialogue modelling

    Pawe Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, I \ n igo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Ga s i \'c . MultiWOZ -- a large-scale multi-domain W izard-of- O z dataset for task-oriented dialogue modelling. In Proceedings of the 2018 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 5016--5026, 2018

  22. [23]

    When context flips, safety breaks: Diagnosing brittle safety in aligned language models

    Dasol Choi and Alex Kwon. When context flips, safety breaks: Diagnosing brittle safety in aligned language models. arXiv preprint arXiv:2605.27851, 2026

  23. [24]

    glaive-function-calling-v2

    Glaive AI . glaive-function-calling-v2. https://huggingface.co/datasets/glaiveai/glaive-function-calling-v2, 2023. Dataset

  24. [25]

    Large language models cannot self-correct reasoning yet

    Jie Huang, Xinyun Chen, Swaroop Mishra, Huaixiu Steven Zheng, Adams Wei Yu, Xinying Song, and Denny Zhou. Large language models cannot self-correct reasoning yet. In International Conference on Learning Representations (ICLR), 2024

  25. [26]

    No robots

    Hugging Face H4 . No robots. https://huggingface.co/datasets/HuggingFaceH4/no_robots, 2023. Dataset

  26. [27]

    LangChain

    LangChain . LangChain . https://github.com/langchain-ai/langchain, 2022. Software

  27. [28]

    Retrieval-augmented generation for knowledge-intensive NLP tasks

    Patrick Lewis, Ethan Perez, Aleksandra Piktus, et al. Retrieval-augmented generation for knowledge-intensive NLP tasks. In Advances in Neural Information Processing Systems (NeurIPS), 2020

  28. [29]

    Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang

    Nelson F. Liu, Kevin Lin, John Hewitt, Ashwin Paranjape, Michele Bevilacqua, Fabio Petroni, and Percy Liang. Lost in the middle: How language models use long contexts. Transactions of the Association for Computational Linguistics, 12: 0 157--173, 2024

  29. [30]

    Self-refine: Iterative refinement with self-feedback

    Aman Madaan, Niket Tandon, Prakhar Gupta, et al. Self-refine: Iterative refinement with self-feedback. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  30. [31]

    mem0: The memory layer for AI agents

    Mem0 AI . mem0: The memory layer for AI agents. https://github.com/mem0ai/mem0, 2024. Software

  31. [32]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems (NeurIPS), 2022

  32. [33]

    Introducing Llama 3.1

    Meta AI . Introducing Llama 3.1. https://ai.meta.com/blog/meta-llama-3-1/, 2024. Model announcement

  33. [34]

    Patil, Ion Stoica, and Joseph E

    Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT : Towards LLMs as operating systems. arXiv preprint arXiv:2310.08560, 2023

  34. [35]

    Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack

    Mark Russinovich, Ahmed Salem, and Ronen Eldan. Great, now write an article about that: The crescendo multi-turn LLM jailbreak attack. In 34th USENIX Security Symposium, 2025

  35. [36]

    Towards understanding sycophancy in language models

    Mrinank Sharma, Meg Tong, Tomasz Korbak, David Duvenaud, et al. Towards understanding sycophancy in language models. In International Conference on Learning Representations (ICLR), 2024

  36. [37]

    Reflexion: Language agents with verbal reinforcement learning

    Noah Shinn, Federico Cassano, Edward Berman, Ashwin Gopinath, Karthik Narasimhan, and Shunyu Yao. Reflexion: Language agents with verbal reinforcement learning. In Advances in Neural Information Processing Systems (NeurIPS), 2023

  37. [38]

    Grok 4.3

    xAI . Grok 4.3. https://docs.x.ai/developers/models/grok-4.3, 2026. Model card

  38. [39]

    C-Pack : Packed resources for general C hinese embeddings

    Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-Pack : Packed resources for general C hinese embeddings. arXiv preprint arXiv:2309.07597, 2023

  39. [40]

    AgentTuning : Enabling generalized agent abilities for LLMs

    Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. AgentTuning : Enabling generalized agent abilities for LLMs . arXiv preprint arXiv:2310.12823, 2023