pith. sign in

arxiv: 2606.25449 · v2 · pith:6G2QJB3Unew · submitted 2026-06-24 · 💻 cs.CL · cs.AI· cs.LG

Reclaim Evaluation: A Lossy Memory Is Worse Than an Empty One

Pith reviewed 2026-06-30 10:18 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG
keywords brittle memoryreclaim evaluationlanguage model memorycorrectabilitysource-first policylossy compressionabstention
0
0 comments X

The pith

A lossy memory in language models leads to confident wrong answers where an empty memory would cause abstention.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that memory retaining incorrect conclusions without their supporting reasoning causes models to output stale errors confidently rather than abstain. Reclaim evaluation measures this by compressing a drifted interaction at a fixed budget then testing whether a correction recovers the known answer against ground truth. Across eight models lossy memory is never better than empty memory and is strictly worse for models disposed to answer. A source-first policy that keeps the recomputable source and drops the re-derivable conclusion restores correctability at equal budget when the source is compact. The failure compounds through memory loops and replicates on deployed systems and real dialogues such as MultiWOZ.

Core claim

A language model's memory can be worse than no memory at all. A memory that keeps a wrong conclusion but drops the work behind it makes the model emit the stale value as a confident answer, where an empty memory would make it abstain; we call this brittle memory. We measure it with reclaim evaluation: compress a drifted interaction at a fixed budget, then test whether a correction recovers the known answer, scored against ground truth with no judge. Correctability is bottlenecked not by capability but by whether the answer-determining source survives compression, so an 8B model and a frontier one wall in the same place. A one-line source-first policy, keep the recomputable source and drop th

What carries the argument

reclaim evaluation, which compresses drifted interactions at fixed budget and checks recovery of known answers after correction to isolate brittle memory

If this is right

  • Lossy memory is never better than empty memory and strictly worse on models disposed to answer rather than abstain.
  • Source-first policy reclaims 0.49-0.88 correctability, rising toward the oracle's 1.00 when a frontier model writes the note.
  • The failure compounds through a memory loop.
  • The pattern replicates on three deployed memory systems and on real dialogue such as MultiWOZ.
  • A length-matched control rules out added text and a deployable one-prompt form works.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • Conversational systems may need explicit source tracking to prevent error compounding over multiple turns.
  • Similar source-preservation rules could apply to other stateful AI components that compress history.
  • Testing the boundary where notes must record their own completeness could identify safe deployment limits.
  • Judge-free exact scoring on paired conditions offers a template for evaluating other memory mechanisms.

Load-bearing premise

The assumption that the answer-determining source is compact and identifiable so that a source-first policy can restore correctability at equal budget.

What would settle it

A case in which lossy memory that retains wrong conclusions recovers the known answer after correction at a higher rate than empty memory on a model that tends to answer rather than abstain.

Figures

Figures reproduced from arXiv: 2606.25449 by Alex Kwon.

Figure 1
Figure 1. Figure 1: Compression decides whether an error stays fixable. A model drifts in session 1; only a compressed memory crosses into session 2, at a fixed budget. Under lossy compression the memory keeps the salient wrong conclusion and discards the source, so a later correction has nothing to recompute from, and the model does not abstain, it confidently returns the stale wrong value. Under source-first compression the… view at source ↗
Figure 2
Figure 2. Figure 2: The boundary of the source-first law. Directed Reclaim Rate vs. ledger size N at two fixed memory budgets B, n=24/point, 95% bootstrap CI. source-first (solid, llama-3.1-8b) holds while the N-item source fits B, then drops to the budget-matched lossy-padded floor (dashed) the instant any item must be dropped. The cliff moves right with the budget (N=5→14 as B doubles), so the lever is whether the answer-de… view at source ↗
Figure 3
Figure 3. Figure 3: Noise crowds the source out of a fixed budget. Directed Reclaim Rate vs. decoy count added to a four-item source at a fixed budget, n=24/point, 95% CI. Naive (positional) source-first (red) decays to the lossy floor as decoys eat the budget; relevance-aware denoised source-first (green) holds flat. The frontier confirm (dotted) coincides with the 8B model: the noise cliff is capability-invariant, because a… view at source ↗
Figure 3
Figure 3. Figure 3: Noise crowds the source out of a fixed budget. Directed RR vs. decoy count added to a four-item source at fixed budget (n=24/point, 95% CI). Naive (positional) source-first (red) decays to the lossy floor as decoys eat the budget; relevance-aware denoised source-first (green) holds flat. The frontier confirm (dotted) coincides with the 8B model: a crowded-out item is an information loss no reader recovers.… view at source ↗
read the original abstract

A language model's memory can be worse than no memory at all. A memory that keeps a wrong conclusion but drops the work behind it makes the model emit the stale value as a confident answer, where an empty memory would make it abstain; we call this brittle memory. We measure it with reclaim evaluation: compress a drifted interaction at a fixed budget, then test whether a correction recovers the known answer, scored against ground truth with no judge. Correctability is bottlenecked not by capability but by whether the answer-determining source survives compression, so an 8B model and a frontier one wall in the same place. Across eight models a lossy memory is never better than an empty one, and strictly worse on those disposed to answer rather than abstain. A one-line source-first policy, keep the recomputable source and drop the re-derivable conclusion, restores correctability at equal budget where the answer-determining source is compact and identifiable; a length-matched control rules out added text, and a deployable one-prompt form reclaims 0.49-0.88, rising toward the oracle's 1.00 when a frontier model writes the note. The failure compounds through a memory loop and replicates on three deployed memory systems and on real dialogue (MultiWOZ), with a located boundary past which the fix fails silently unless the note records its completeness. This is a controlled study of a mechanism: judge-free exact scoring, matched-budget controls, and validators built to come out false; we release the harness, the paired memory conditions, and these validators.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 1 minor

Summary. The paper claims that lossy memory in language models can be strictly worse than no memory because it retains incorrect conclusions while dropping supporting work, causing confident but wrong answers where an empty memory would lead to abstention. It introduces 'reclaim evaluation,' which compresses drifted interactions at fixed budget and measures whether a correction recovers the known ground-truth answer (judge-free exact scoring). Across eight models, lossy memory is never better than empty and is strictly worse for models disposed to answer rather than abstain. A one-line source-first policy (keep recomputable source, drop re-derivable conclusion) restores correctability (0.49-0.88, approaching oracle 1.00) at equal budget where the answer-determining source is compact and identifiable; this is validated with length-matched controls, a one-prompt deployable form, replication on three deployed memory systems and MultiWOZ dialogue, and a noted boundary where the fix fails unless completeness is recorded. The study emphasizes controlled design, released harness/artifacts/validators, and that correctability is limited by source survival rather than model scale.

Significance. If the results hold, the work identifies a concrete mechanism by which memory can degrade performance below the no-memory baseline, with direct implications for memory architectures in conversational and agentic systems. Strengths include the judge-free exact-match scoring against external ground truth, matched-budget controls that rule out length artifacts, released harness and paired conditions, and the observation that an 8B model and frontier model are bottlenecked at the same point by source survival. The replication on deployed systems and real dialogue adds practical relevance. The conditional nature of the source-first policy (compact/identifiable sources) limits the generality of the constructive claim but does not undermine the core empirical finding on lossy vs. empty memory.

major comments (1)
  1. [Abstract] Abstract: the claim that the source-first policy 'restores correctability at equal budget' is explicitly qualified by the precondition 'where the answer-determining source is compact and identifiable,' yet the experiments (controlled study, MultiWOZ) appear to operate only in regimes where sources are already short and isolatable; the matched-budget control therefore does not demonstrate that the policy works when this precondition is absent, which is load-bearing for the reported 0.49-0.88 reclamation numbers and the 'one-line' policy's advertised advantage.
minor comments (1)
  1. [Abstract] Abstract: the phrase 'a located boundary past which the fix fails silently unless the note records its completeness' is introduced without a precise definition or measurement protocol; adding a short operational definition or pointer to the relevant section would improve clarity.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the careful and constructive review. We address the single major comment below.

read point-by-point responses
  1. Referee: [Abstract] Abstract: the claim that the source-first policy 'restores correctability at equal budget' is explicitly qualified by the precondition 'where the answer-determining source is compact and identifiable,' yet the experiments (controlled study, MultiWOZ) appear to operate only in regimes where sources are already short and isolatable; the matched-budget control therefore does not demonstrate that the policy works when this precondition is absent, which is load-bearing for the reported 0.49-0.88 reclamation numbers and the 'one-line' policy's advertised advantage.

    Authors: The abstract and body text explicitly qualify the source-first policy with the precondition that the answer-determining source is compact and identifiable; we do not claim the policy works when this precondition is absent. The controlled study and MultiWOZ experiments are conducted precisely in regimes satisfying the precondition, which are the settings in which the policy is intended to apply. The matched-budget controls therefore demonstrate the policy's effect under the stated conditions, and the 0.49-0.88 reclamation figures are reported only for those regimes. The paper already notes boundary conditions where the fix fails unless completeness is recorded. We stand by the qualified claim and do not interpret the referee's observation as requiring removal of the qualification or expansion of the tested regimes. revision: no

Circularity Check

0 steps flagged

No significant circularity; empirical claims grounded in external benchmarks and released controls

full rationale

The paper presents an empirical study using judge-free exact scoring against ground truth, matched-budget controls, and a released harness across eight models. No derivation chain, equations, or load-bearing claims reduce by construction to self-definitions, fitted inputs renamed as predictions, or self-citation. The source-first policy is introduced as a one-line heuristic and tested under explicitly stated conditions (compact identifiable sources), with failure boundaries noted; these are not tautological. Central results (lossy memory never better than empty) are falsifiable via the released validators and do not rely on internal redefinitions. This is self-contained against external benchmarks, consistent with the default non-circular outcome for controlled empirical work.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Based solely on the abstract, the work is an empirical evaluation with no mentioned free parameters, axioms, or invented entities; the claims rest on experimental controls and released artifacts.

pith-pipeline@v0.9.1-grok · 5809 in / 1212 out tokens · 51694 ms · 2026-06-30T10:18:53.717582+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

  1. Manufactured Confidence: How Memory Consolidation Turns Hearsay into Confident Facts

    cs.CR 2026-06 unverdicted novelty 5.0

    LLM memory consolidation turns casual hedged statements into confident facts that agents obey regardless of source or verification.

Reference graph

Works this paper leans on

12 extracted references · 7 canonical work pages · cited by 1 Pith paper · 6 internal anchors

  1. [1]

    Claude Opus 4.8.https://www.anthropic.com/news/claude-opus-4-8, 2026a

    Anthropic. Claude Opus 4.8.https://www.anthropic.com/news/claude-opus-4-8, 2026a. Model announce- ment. Anthropic. Claude Sonnet 4.6. https://www.anthropic.com/news/claude-sonnet-4-6, 2026b. Model announcement. Paweł Budzianowski, Tsung-Hsien Wen, Bo-Hsiang Tseng, Iñigo Casanueva, Stefan Ultes, Osman Ramadan, and Milica Gašić. MultiWOZ – a large-scale mul...

  2. [2]

    Adapting language models to compress contexts

    Alexis Chevalier, Alexander Wettig, Anirudh Ajith, and Danqi Chen. Adapting language models to compress contexts. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  3. [3]

    When Context Flips, Safety Breaks: Diagnosing Brittle Safety in Aligned Language Models

    Dasol Choi and Alex Kwon. When context flips, safety breaks: Diagnosing brittle safety in aligned language models.arXiv preprint arXiv:2605.27851,

  4. [4]

    A Survey on Hallucination in Large Language Models: Principles, Taxonomy, Challenges, and Open Questions

    Lei Huang, Weijiang Yu, Weitao Ma, et al. A survey on hallucination in large language models: Principles, taxonomy, challenges, and open questions.arXiv preprint arXiv:2311.05232,

  5. [5]

    LLMLingua: Compressing prompts for accelerated inference of large language models

    Huiqiang Jiang, Qianhui Wu, Chin-Yew Lin, Yuqing Yang, and Lili Qiu. LLMLingua: Compressing prompts for accelerated inference of large language models. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  6. [6]

    Language Models (Mostly) Know What They Know

    Saurav Kadavath, Tom Conerly, Amanda Askell, et al. Language models (mostly) know what they know. arXiv preprint arXiv:2207.05221,

  7. [7]

    LLMs Get Lost In Multi-Turn Conversation

    Philippe Laban, Hiroaki Hayashi, Yingbo Zhou, and Jennifer Neville. LLMs get lost in multi-turn conversation. arXiv preprint arXiv:2505.06120,

  8. [8]

    MemGPT: Towards LLMs as Operating Systems

    Model card. Charles Packer, Sarah Wooders, Kevin Lin, Vivian Fang, Shishir G. Patil, Ion Stoica, and Joseph E. Gonzalez. MemGPT: Towards LLMs as operating systems.arXiv preprint arXiv:2310.08560,

  9. [9]

    C-Pack: Packed Resources For General Chinese Embeddings

    Model card. Shitao Xiao, Zheng Liu, Peitian Zhang, and Niklas Muennighoff. C-Pack: Packed resources for general Chinese embeddings.arXiv preprint arXiv:2309.07597,

  10. [10]

    Editing large language models: Problems, methods, and opportunities

    Yunzhi Yao, Peng Wang, Bozhong Tian, Siyuan Cheng, Zhoubo Li, Shumin Deng, Huajun Chen, and Ningyu Zhang. Editing large language models: Problems, methods, and opportunities. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

  11. [11]

    AgentTuning: Enabling generalized agent abilities for LLMs.arXiv preprint arXiv:2310.12823,

    Aohan Zeng, Mingdao Liu, Rui Lu, Bowen Wang, Xiao Liu, Yuxiao Dong, and Jie Tang. AgentTuning: Enabling generalized agent abilities for LLMs.arXiv preprint arXiv:2310.12823,

  12. [12]

    worse than empty

    and explicitlynotan absolute coverage figure: the two labelers bracket the compact share rather than pinning it. The agreement is two LLMs labeling the same text, not a human gold standard. Human spot-check (extra verification).Because both labelers are LLMs, we add a human anchor. The first author labeled a stratified51-conversation slice (17per domain)b...