Auditing Forgetting in Limited Memory Language Models

Arya Raeesi; Hanna Roed

arxiv: 2607.00605 · v1 · pith:5USLJZY5new · submitted 2026-07-01 · 💻 cs.CL · cs.AI· cs.LG

Auditing Forgetting in Limited Memory Language Models

Arya Raeesi , Hanna Roed This is my paper

Pith reviewed 2026-07-02 13:07 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.LG

keywords limited memory language modelsmachine unlearningparametric leakageretrieval artifactscausal auditingdatabase deletionalias closureforgetting evaluation

0 comments

The pith

Deleted facts in limited memory language models persist almost exclusively through retrieval artifacts rather than parametric memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a causal auditing framework for limited memory language models that store knowledge in an external database. It varies the database state at inference time while holding the model fixed, across three conditions that separate parametric effects from retrieval. Tests on over twelve thousand alias-closure deletions in multiple domains and adversarial database topologies show parametric leakage near zero in every case. Remaining post-deletion performance matches the rate of retrieval artifacts from near-neighbor entries, not residual internal memory. This matters because it shows unlearning success is controlled by how the database is managed rather than by changes inside the model.

Core claim

By holding the model fixed and varying the database state at inference across FULL, DEL-ON, and DEL-OFF conditions, the audit decomposes post-deletion performance into near-zero parametric leakage, retrieval-mediated correctness, and retrieval artifact rate. Across all tested variants and prompts, parametric leakage stays near zero while retrieval-mediated correctness matches the artifact rate, indicating that surviving knowledge is reconstituted from near-neighbor retrieval in the edited database. This residual ranges from 0.7% on the released LMLM database to 13.6% on the most adversarial variant.

What carries the argument

Causal auditing framework that decomposes post-deletion behavior into parametric leakage L(f), retrieval-mediated correctness R(f), and retrieval artifact rate by comparing FULL, DEL-ON, and DEL-OFF database states at inference time.

If this is right

Parametric leakage remains near zero across all prompt styles and database topologies.
Retrieval-mediated correctness matches the retrieval artifact rate within rounding.
Post-deletion correctness is reconstituted from near-neighbor retrieval rather than residual parametric memory.
The unlearning boundary is set by database administration rather than model parameters.
Prompt formulation does not independently control how much of a deleted fact survives.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Database designers could lower residual leakage by reducing alias density or collision probability in the retrieval graph.
The same auditing method could check whether forgetting is genuine in other retrieval-augmented models.
Adversarial database topologies may require more than simple deletion to achieve low residual survival.
Security against recovery of deleted facts in LMLMs depends on controlling the structure of the external memory.

Load-bearing premise

The three interventions correctly isolate parametric memory from retrieval effects without introducing new artifacts from the specific database topologies or prompt formulations.

What would settle it

A prompt or database variant where the model returns the deleted answer at substantially higher rates in the DEL-OFF condition than in the DEL-ON condition would indicate meaningful parametric leakage.

Figures

Figures reproduced from arXiv: 2607.00605 by Arya Raeesi, Hanna Roed.

**Figure 1.** Figure 1: Comparison of a standard retrieval-augmented language model (LLM + RAG) and a LMLM. Both architectures pair a parametric model with an external database, but LMLMs are pretrained to limit the internal storage of factual knowledge, so factual recall is routed through the external store rather than reconstructed from parameters (Zhao et al., 2025). However, it remains unclear whether deletion in these syst… view at source ↗

**Figure 2.** Figure 2: outlines this pipeline [PITH_FULL_IMAGE:figures/full_fig_p003_2.png] view at source ↗

**Figure 4.** Figure 4: Attribution of DEL-ON correctness by prompt style. The left bar in each pair stacks parametric leakage L(f) (red, Lˆ) and retrieval-mediated correctness R(f) (green); the orange bar shows the retrieval artifact rate. Leakage is near zero across prompt styles, with a small residue under direct and few-shot. The L(f) + R(f) stack is 9.9% (direct), 7.9% (contextual), 8.7% (paraphrased), 6.1% (cloze), 8.7% (co… view at source ↗

**Figure 3.** Figure 3: Attribution of DEL-ON correctness by database variant. The stacked left bar in each pair decomposes DEL-ON correctness into parametric leakage L(f) (red, Lˆ) and retrieval-mediated correctness R(f) (green); the orange bar shows the retrieval artifact rate. Parametric leakage is near zero in every variant. The combined L(f) + R(f) stack is 3.7% for Base, 6.2% for Alias, 7.2% for Collision, 13.6% for Noise… view at source ↗

**Figure 5.** Figure 5: Weighted token F1 by prompt style and intervention state. FULL F1 is ∼41% for direct, ∼44% for contextual, ∼34% for paraphrased, ∼26% for cloze, ∼57% for continuations, and ∼17% for few-shot prompts. DEL-ON F1 sits between ∼7% and ∼10% across all six prompt styles. DEL-OFF F1 is near zero in every prompt style [PITH_FULL_IMAGE:figures/full_fig_p004_5.png] view at source ↗

**Figure 6.** Figure 6: Weighted exact match by database variant and intervention state. FULL exact match is ∼33% on Base, Alias, and Noise, and ∼46% on Collision and the released LMLM database. DEL-ON exact match is ∼3% on Base, ∼6% on Alias, ∼7% on Collision, ∼14% on Noise, and ∼1% on the released LMLM database. DEL-OFF exact match is near zero in every variant. Tables 5 and 6 in Appendix E report the full per-cell numbers und… view at source ↗

**Figure 7.** Figure 7: Discordant-pair counts driving the paired McNemar test for DEL-ON against DEL-OFF, broken out by database variant. Bar height is on a log scale so the much smaller c counts remain legible alongside b. Annotations above each variant give the p-value of the recommended test (χ 2 with continuity correction when b + c > 25, exact binomial otherwise). Across the four custom variants the asymmetry is unambiguous… view at source ↗

read the original abstract

Limited Memory Language Models (LMLMs) externalize factual knowledge to a database to enable deletion-based unlearning without retraining. Existing evaluations measure post-deletion correctness in aggregate and cannot tell whether a deleted fact persists through residual parametric memory, alternative retrieval paths, or near-neighbor retrieval artifacts. We propose a causal auditing framework that holds the model fixed and varies the database state at inference time across three interventions: FULL, DEL-ON, and DEL-OFF. The framework decomposes post-deletion behavior into parametric leakage L(f), retrieval-mediated correctness R(f), and a retrieval artifact rate grounded in the inference-time retrieval trace. We apply it to 12,228 alias-closure deletions across thirteen databases, including four adversarial topologies (Base, Alias, Noise, Collision) we construct in three domains, and six prompt formulations. Parametric leakage is near zero in every variant and every prompt style: the model rarely returns the deleted answer in the absence of retrieval. The residual that does survive lives in the retrieval graph: retrieval-mediated correctness and the retrieval artifact rate match within rounding everywhere, so post-deletion correctness is, in our audit, predominantly reconstituted from near-neighbor retrieval. This residual ranges from 0.7% on the released LMLM database to 13.6% on the most adversarial variant, and prompt formulation does not independently control how much of a deleted fact survives. These results suggest that, for this class of LMLM and deletion procedure, the unlearning boundary is drawn primarily by the database administrator rather than by the model.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The causal auditing framework with the three interventions is new and practical, but the claim of near-zero parametric leakage depends on an untested assumption that DEL-OFF leaves the model's generation behavior unchanged.

read the letter

The paper introduces a causal auditing method for limited-memory LMs that externalize facts to a database. It runs the same model under three database states at inference time—full, deleted with retrieval on, and deleted with retrieval off—then decomposes the remaining correctness into parametric leakage L(f), retrieval-mediated correctness R(f), and a neighbor artifact rate. That decomposition and the explicit interventions do not appear in the cited prior work.

The experiments cover 12,228 deletions across thirteen databases, four adversarial topologies, and six prompt styles. The central observation is that L(f) stays near zero in all cases while the residual correctness tracks the artifact rate, so most surviving answers after deletion come from near-neighbor retrieval rather than the model parameters.

The weakest point is the isolation step. DEL-OFF is meant to measure pure parametric memory, but the paper does not report whether removing retrieval access changes refusal rates, fallback strategies, or overall output distribution. If it does, the low L(f) observed under DEL-OFF may not hold when retrieval remains active but the target fact is gone. The abstract gives concrete percentages without error bars or statistical tests, which makes the exact size of the residual harder to assess.

The framework itself is a clear step for anyone who needs to verify deletion in these systems. A reader working on unlearning or privacy-compliant retrieval models would find the setup and the scale of the tests useful even if the numbers require tighter validation.

Send it to referees. The idea is grounded enough and the experiments broad enough that review would strengthen the isolation argument and add the missing statistics.

Referee Report

2 major / 2 minor

Summary. The paper introduces a causal auditing framework for Limited Memory Language Models (LMLMs) that externalize knowledge to a database. It varies database state at inference via FULL, DEL-ON, and DEL-OFF interventions to decompose post-deletion behavior into parametric leakage L(f), retrieval-mediated correctness R(f), and retrieval artifact rate. Applied to 12,228 alias-closure deletions across 13 databases (including four adversarial topologies) and six prompt formulations, it reports parametric leakage near zero in all cases, with residual post-deletion correctness (0.7% to 13.6%) attributable to near-neighbor retrieval artifacts rather than parametric memory.

Significance. If the interventions validly isolate the components without confounding, this provides a useful empirical decomposition showing that unlearning boundaries in this LMLM class are set primarily by database design. The evaluation across adversarial topologies and prompt styles adds robustness. The framework could support future audits of retrieval-augmented models.

major comments (2)

[Auditing framework] Description of the three interventions (auditing framework): The claim that L(f) is near zero rests on DEL-OFF isolating pure parametric memory. However, disabling retrieval may independently alter the model's generation regime (e.g., refusal rates or fallback strategies), so the low rate of returning the deleted answer under DEL-OFF does not necessarily imply the same under enabled retrieval without the target fact. This assumption is load-bearing for the central conclusion and is not validated by direct comparisons of output distributions between conditions.
[Empirical results] Results reporting the leakage and artifact rates (0.7% to 13.6% across variants): The concrete percentages are presented without error bars, confidence intervals, or statistical tests (e.g., tests against zero or equivalence between R(f) and artifact rate). With 12,228 deletions, this omission makes it difficult to assess whether 'near zero' and 'match within rounding' are statistically supported or sensitive to sampling.

minor comments (2)

The notation L(f), R(f) and artifact rate would benefit from an explicit equation or definition box in the main text rather than relying on the abstract and later prose.
A summary table aggregating rates by topology and prompt formulation would improve readability of the multi-condition results.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the careful reading and constructive comments. We address each major point below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: Description of the three interventions (auditing framework): The claim that L(f) is near zero rests on DEL-OFF isolating pure parametric memory. However, disabling retrieval may independently alter the model's generation regime (e.g., refusal rates or fallback strategies), so the low rate of returning the deleted answer under DEL-OFF does not necessarily imply the same under enabled retrieval without the target fact. This assumption is load-bearing for the central conclusion and is not validated by direct comparisons of output distributions between conditions.

Authors: We agree that the validity of DEL-OFF as an isolator of parametric leakage requires confirmation that the generation regime is not materially altered by disabling retrieval. In the revised manuscript we will add explicit comparisons of output distributions (refusal rates, fallback strategy frequencies, and token-level statistics) between DEL-ON and DEL-OFF conditions on the same prompts. These comparisons will be reported for each database and prompt formulation to substantiate that the low L(f) rates reflect absence of parametric memory rather than a change in generation behavior. revision: yes
Referee: Results reporting the leakage and artifact rates (0.7% to 13.6% across variants): The concrete percentages are presented without error bars, confidence intervals, or statistical tests (e.g., tests against zero or equivalence between R(f) and artifact rate). With 12,228 deletions, this omission makes it difficult to assess whether 'near zero' and 'match within rounding' are statistically supported or sensitive to sampling.

Authors: We acknowledge that the absence of uncertainty quantification and formal tests limits the strength of the claims. In the revision we will report binomial confidence intervals for all rates and add statistical tests (equivalence tests for R(f) versus artifact rate, and one-sided tests against zero for L(f)) using the existing 12,228 observations. These additions will be included in the main results table and the supplementary material. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical decomposition via direct interventions

full rationale

The paper's central claims rest on operational definitions of L(f), R(f), and artifact rate measured directly from three controlled database-state interventions (FULL, DEL-ON, DEL-OFF) applied to 12,228 deletions across multiple topologies and prompts. No equations, fitted parameters, or self-citations are invoked to derive the reported near-zero parametric leakage; the results are raw empirical counts under each intervention. The decomposition is therefore self-contained against external benchmarks and does not reduce any quantity to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The paper relies on the domain assumption that varying the database at inference time while holding the model fixed isolates parametric memory effects from retrieval effects; no free parameters or invented entities are introduced in the abstract.

axioms (1)

domain assumption Varying the database state at inference time while holding the model fixed isolates parametric memory effects from retrieval effects.
This is the core of the causal auditing framework described in the abstract.

pith-pipeline@v0.9.1-grok · 5808 in / 1342 out tokens · 28921 ms · 2026-07-02T13:07:30.046131+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

13 extracted references · 12 canonical work pages · 6 internal anchors

[1]

URL https://arxiv.org/abs/1912.03817. Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert- 7 Auditing Forgetting in Limited Memory Language Models V oss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, ´U., Oprea, A., and Raffel, C. Extracting training data from large language models. InUSENIX Security Symposium,

work page arXiv 1912
[2]

org/abs/2012.07805

URL https://arxiv. org/abs/2012.07805. Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.- W. Realm: Retrieval-augmented language model pre- training. InInternational Conference on Machine Learn- ing,

work page arXiv 2012
[3]

URL https://arxiv.org/abs/2002. 08909. Karpukhin, V ., O˘guz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),

2002
[4]

Dense Passage Retrieval for Open-Domain Question Answering

URL https: //arxiv.org/abs/2004.04906. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K¨uttler, H., Lewis, M., Yih, W.-t., Rockt¨aschel, T., Riedel, S., and Kiela, D. Retrieval-augmented gen- eration for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv 2004
[5]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

URL https://arxiv.org/abs/2005.11401. Lizzo, T. and Heck, L. Unlearning in llms: Methods, evaluation, and open challenges,

work page internal anchor Pith review Pith/arXiv arXiv 2005
[6]

Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z

URL https: //arxiv.org/abs/2601.13264. Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z. C., and Kolter, J. Z. Tofu: A task of fictitious unlearning for llms,

work page arXiv
[7]

TOFU: A Task of Fictitious Unlearning for LLMs

URL https://arxiv.org/abs/ 2401.06121. Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., and Hajishirzi, H. When not to trust language mod- els: Investigating effectiveness of parametric and non- parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (ACL),

work page internal anchor Pith review Pith/arXiv arXiv
[8]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

URL https://arxiv.org/abs/ 2212.10511. Meng, K., Bau, D., Andonian, A., and Belinkov, Y . Locat- ing and editing factual associations in gpt. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

Locating and Editing Factual Associations in GPT

URL https://arxiv.org/abs/2202.05262. Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y ., and Bau, D. Mass-editing memory in a transformer. InInter- national Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

Mass-Editing Memory in a Transformer

URLhttps://arxiv.org/abs/2210.07229. Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-t., Koh, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. Factscore: Fine-grained atomic evaluation of factual pre- cision in long form text generation. InEmpirical Methods in Natural Language Processing (EMNLP),

work page internal anchor Pith review Pith/arXiv arXiv
[11]

Yao, Y ., Wang, P., Tian, B., Cheng, S., Li, Z., Deng, S., Chen, H., and Zhang, N

URL https://arxiv.org/abs/2305.14251. Yao, Y ., Wang, P., Tian, B., Cheng, S., Li, Z., Deng, S., Chen, H., and Zhang, N. Editing large language models: Problems, methods, and opportunities. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page arXiv 2023
[12]

URL https: //arxiv.org/abs/2305.13172. Zhao, L. and contributors. Lmlm. https:// github.com/kilian-group/LMLM,

work page arXiv
[13]

8 Auditing Forgetting in Limited Memory Language Models A

URL https://arxiv.org/abs/2505.15962. 8 Auditing Forgetting in Limited Memory Language Models A. Architecture This appendix details the two design axes that define the prompt × database grid used throughout the audit: the six prompt formulations applied to every target fact, and the four custom database topologies constructed to stress-test alias-closure ...

work page arXiv

[1] [1]

URL https://arxiv.org/abs/1912.03817. Carlini, N., Tramer, F., Wallace, E., Jagielski, M., Herbert- 7 Auditing Forgetting in Limited Memory Language Models V oss, A., Lee, K., Roberts, A., Brown, T., Song, D., Erlingsson, ´U., Oprea, A., and Raffel, C. Extracting training data from large language models. InUSENIX Security Symposium,

work page arXiv 1912

[2] [2]

org/abs/2012.07805

URL https://arxiv. org/abs/2012.07805. Guu, K., Lee, K., Tung, Z., Pasupat, P., and Chang, M.- W. Realm: Retrieval-augmented language model pre- training. InInternational Conference on Machine Learn- ing,

work page arXiv 2012

[3] [3]

URL https://arxiv.org/abs/2002. 08909. Karpukhin, V ., O˘guz, B., Min, S., Lewis, P., Wu, L., Edunov, S., Chen, D., and Yih, W.-t. Dense passage retrieval for open-domain question answering. InProceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP),

2002

[4] [4]

Dense Passage Retrieval for Open-Domain Question Answering

URL https: //arxiv.org/abs/2004.04906. Lewis, P., Perez, E., Piktus, A., Petroni, F., Karpukhin, V ., Goyal, N., K¨uttler, H., Lewis, M., Yih, W.-t., Rockt¨aschel, T., Riedel, S., and Kiela, D. Retrieval-augmented gen- eration for knowledge-intensive nlp tasks. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv 2004

[5] [5]

Retrieval-Augmented Generation for Knowledge-Intensive NLP Tasks

URL https://arxiv.org/abs/2005.11401. Lizzo, T. and Heck, L. Unlearning in llms: Methods, evaluation, and open challenges,

work page internal anchor Pith review Pith/arXiv arXiv 2005

[6] [6]

Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z

URL https: //arxiv.org/abs/2601.13264. Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z. C., and Kolter, J. Z. Tofu: A task of fictitious unlearning for llms,

work page arXiv

[7] [7]

TOFU: A Task of Fictitious Unlearning for LLMs

URL https://arxiv.org/abs/ 2401.06121. Mallen, A., Asai, A., Zhong, V ., Das, R., Khashabi, D., and Hajishirzi, H. When not to trust language mod- els: Investigating effectiveness of parametric and non- parametric memories. InProceedings of the 61st Annual Meeting of the Association for Computational Linguis- tics (ACL),

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

When Not to Trust Language Models: Investigating Effectiveness of Parametric and Non-Parametric Memories

URL https://arxiv.org/abs/ 2212.10511. Meng, K., Bau, D., Andonian, A., and Belinkov, Y . Locat- ing and editing factual associations in gpt. InAdvances in Neural Information Processing Systems,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

Locating and Editing Factual Associations in GPT

URL https://arxiv.org/abs/2202.05262. Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y ., and Bau, D. Mass-editing memory in a transformer. InInter- national Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

Mass-Editing Memory in a Transformer

URLhttps://arxiv.org/abs/2210.07229. Min, S., Krishna, K., Lyu, X., Lewis, M., Yih, W.-t., Koh, P. W., Iyyer, M., Zettlemoyer, L., and Hajishirzi, H. Factscore: Fine-grained atomic evaluation of factual pre- cision in long form text generation. InEmpirical Methods in Natural Language Processing (EMNLP),

work page internal anchor Pith review Pith/arXiv arXiv

[11] [11]

Yao, Y ., Wang, P., Tian, B., Cheng, S., Li, Z., Deng, S., Chen, H., and Zhang, N

URL https://arxiv.org/abs/2305.14251. Yao, Y ., Wang, P., Tian, B., Cheng, S., Li, Z., Deng, S., Chen, H., and Zhang, N. Editing large language models: Problems, methods, and opportunities. InProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing (EMNLP),

work page arXiv 2023

[12] [12]

URL https: //arxiv.org/abs/2305.13172. Zhao, L. and contributors. Lmlm. https:// github.com/kilian-group/LMLM,

work page arXiv

[13] [13]

8 Auditing Forgetting in Limited Memory Language Models A

URL https://arxiv.org/abs/2505.15962. 8 Auditing Forgetting in Limited Memory Language Models A. Architecture This appendix details the two design axes that define the prompt × database grid used throughout the audit: the six prompt formulations applied to every target fact, and the four custom database topologies constructed to stress-test alias-closure ...

work page arXiv