Output Vector Editing for Memorization Mitigation in Large Language Models

Ahmad Dawar Hakimi; Hinrich Sch\"utze; Isabelle Augenstein; Kaiwei Lei

arxiv: 2606.18767 · v1 · pith:DOVJRYV5new · submitted 2026-06-17 · 💻 cs.CL

Output Vector Editing for Memorization Mitigation in Large Language Models

Ahmad Dawar Hakimi , Kaiwei Lei , Isabelle Augenstein , Hinrich Sch\"utze This is my paper

Pith reviewed 2026-06-26 21:21 UTC · model grok-4.3

classification 💻 cs.CL

keywords memorization mitigationlarge language modelsoutput vector editingMLP neuronsresidual streamneuron localizationconstrained optimization

0 comments

The pith

Output vector editing on MLP neurons suppresses up to 87.9% of memorized sequences by redirecting residual stream contributions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper proposes output vector editing to mitigate memorization in large language models. It locates a small set of MLP neurons responsible for a memorized continuation and applies constrained optimization to minimally change their output vectors, introducing a distractor in vocabulary space that redirects residual stream contributions without altering activations. This produces up to 87.9% suppression on OLMo-7B and a 2.7 times larger effect than zero ablation on the same neurons. The method transfers across four models from 360M to 7B parameters, with success rates increasing with model size. It also identifies a boundary of about 14% of sequences that resist MLP-only editing.

Core claim

A constrained-optimization weight edit locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged, achieving up to 87.9% suppression with a 2.7 times gap over zero ablation on the same located neurons.

What carries the argument

Output vector editing: a constrained-optimization procedure that identifies responsible MLP neurons and edits their output vectors to redirect contributions to the residual stream.

If this is right

Four edit modes span aggressive suppression to minimal redirection, with their ensemble covering 96.5% of memorized sequences.
A single recommended mode reaches 81.5% success with no catastrophic locality failures.
Success rates scale with model size rather than model family across 360M to 7B parameters.
About 14% of sequences remain unreachable by MLP-only editing, though attention ablation recovers 60-64% of them.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Attention mechanisms could serve as a complementary fallback for the 14% of sequences that resist MLP editing.
The same output-vector approach might apply to other unwanted model behaviors such as generating toxic text.
Similar edits performed at inference time could avoid permanent changes to model weights.

Load-bearing premise

The neuron localization step reliably identifies the responsible neurons without post-hoc selection inflating the reported gap over zero ablation.

What would settle it

Zero ablation on the same located neurons produces suppression rates within 10-15% of the output vector edit rates, or attention head ablation fails to recover 60-64% of the sequences unreachable by MLP editing.

Figures

Figures reproduced from arXiv: 2606.18767 by Ahmad Dawar Hakimi, Hinrich Sch\"utze, Isabelle Augenstein, Kaiwei Lei.

**Figure 2.** Figure 2: Success–locality trade-off on OLMo-7B; lower-right is better (high success, low median [PITH_FULL_IMAGE:figures/full_fig_p007_2.png] view at source ↗

**Figure 3.** Figure 3: Annotation prompt used for LLM-based failure mode classification (Claude Opus) of output [PITH_FULL_IMAGE:figures/full_fig_p023_3.png] view at source ↗

**Figure 4.** Figure 4: Generated next token under activation patching of neuron L7N10794 across [PITH_FULL_IMAGE:figures/full_fig_p025_4.png] view at source ↗

read the original abstract

Large language models memorize and reproduce sequences from their training data, creating privacy, copyright, and security risks. Existing neuron-level mitigation methods equate editing with zeroing out neuron activations, but the activation only controls whether a neuron engages; the output vector is what writes to the residual stream and, through superposition, encodes multiple features. We propose output vector editing, a constrained-optimization weight edit that locates a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space, redirecting their residual-stream contributions while leaving activations unchanged. Evaluating on four models from 360M to 7B parameters (SmolLM-360M, OLMo-1B, OLMo-7B, Llama2-7B), we center on OLMo-7B (whose open weights and pretraining corpus enable systematic mining) and mine 6831 memorized sequences, achieving up to 87.9% suppression. The 2.7$\times$ gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone. Four edit modes span a spectrum from aggressive suppression to minimal redirection; in ensemble they cover 96.5% of memorized sequences, while our recommended single-mode configuration reaches 81.5% with no catastrophic locality failures. We further identify a mechanistic boundary at ${\sim}14%$ of sequences unreachable by MLP-only editing; while these failures are not attention-driven overall, ablating the top contributing attention heads recovers 60--64% of them, with stronger recovery on continuations that copy tokens from the prefix, positioning attention as a complementary fallback rather than a primary mechanism. Edit mode ordering and the success-locality trade-off transfer across all four models, with success rates scaling with model size rather than family.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Output vector editing gives a measurable edge over zero ablation for memorization suppression, but the paper needs to lock down the neuron localization procedure to make the comparison convincing.

read the letter

The main thing here is a practical edit method that suppresses memorized sequences by tweaking the output vectors of a handful of MLP neurons instead of zeroing their activations. They locate the neurons, then run constrained optimization to add a distractor in vocabulary space so the residual-stream contribution gets redirected while the activation stays the same. On OLMo-7B with 6831 mined sequences this reaches 87.9% suppression and shows a 2.7x gap over zero ablation on the exact same neurons. The gap is the key evidence that the output-vector change itself matters, not just finding the right neurons.

What the paper does well is run the same edit modes across four models from 360M to 7B, show that success and the locality trade-off transfer, and identify the roughly 14% of sequences that MLP edits miss, where attention ablation recovers 60-64%. The ensemble hitting 96.5% and the single-mode 81.5% with no big side effects are useful numbers for anyone trying to deploy this.

The soft spots are mostly around missing details. The abstract gives no error bars, no description of how the 6831 sequences were filtered, and no explicit statement that neuron selection was fixed before seeing edit results. The stress-test point lands: if the localization step uses gradients or importance scores that correlate with how easy a neuron is to redirect, the 2.7x advantage could be partly an artifact. The full paper needs to show the localization criterion was pre-specified and independent.

This is for people working on memorization mitigation and LLM safety. A reader who needs a concrete, scalable edit technique will find usable results and the cross-model checks. It is worth sending to peer review because the core distinction between activation and output vector is substantive and the empirical gains are reported clearly, even though the methods section will need tightening on selection and variance.

Referee Report

3 major / 2 minor

Summary. The paper proposes output vector editing, a constrained-optimization weight edit that identifies a small set of MLP neurons responsible for a memorized continuation and minimally modifies their output vectors to introduce a distractor in vocabulary space. This redirects residual-stream contributions while leaving activations unchanged. On OLMo-7B (and three other models), it reports up to 87.9% suppression on 6831 mined sequences, a 2.7× gap over zero ablation on the same neurons, 96.5% coverage via an ensemble of four edit modes, and a ~14% set of sequences better addressed by attention ablation as a complementary mechanism. The method is claimed to transfer across model sizes with success rates scaling with scale.

Significance. If the localization procedure is pre-specified and independent of edit outcomes, the work provides a targeted, activation-preserving alternative to zero-ablation for memorization mitigation that could improve locality-effectiveness trade-offs. The cross-model evaluation, explicit comparison to zero ablation, and identification of an MLP/attention boundary constitute concrete empirical contributions. The absence of error bars, sequence-mining details, and an explicit independence guarantee for neuron selection, however, limits the strength of the central attribution claim.

major comments (3)

[Abstract] Abstract: The central claim that 'the 2.7× gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone' is load-bearing. The manuscript provides no explicit statement or procedure guaranteeing that neuron localization (via importance scores, gradients, or thresholds) was performed and fixed independently of observing edit success; post-hoc selection of neurons whose output vectors are easy to redirect would artifactually inflate the gap.
[Abstract] Abstract: The evaluation rests on 6831 mined memorized sequences, yet no description is given of the mining/filtering algorithm, the exact criterion for deeming a continuation 'memorized,' the search procedure over the pretraining corpus, or any exclusion rules. This directly affects reproducibility of the 87.9% suppression figure and the claimed generality.
[Abstract] Abstract: Key quantitative results (87.9% suppression, 2.7× gap, 96.5% ensemble coverage, 81.5% single-mode) are reported without error bars, standard deviations across runs or sequences, or statistical significance tests, making it impossible to assess whether the reported advantage over zero ablation is reliable.

minor comments (2)

[Abstract] Abstract: The four edit modes are referenced as spanning 'aggressive suppression to minimal redirection' but are neither named nor briefly characterized, hindering reader understanding of the spectrum and the recommended single-mode configuration.
The abstract states that edit-mode ordering and success-locality trade-offs transfer across models, but provides no quantitative transfer metrics or tables; a short summary table would improve clarity.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. The three major comments identify important gaps in documentation that affect reproducibility and the strength of our central claims. We address each point below and commit to revisions that will incorporate the requested details without altering the core methodology or results.

read point-by-point responses

Referee: [Abstract] Abstract: The central claim that 'the 2.7× gap over zero ablation on the same located neurons shows the suppression comes from the output-vector edit, not localization alone' is load-bearing. The manuscript provides no explicit statement or procedure guaranteeing that neuron localization (via importance scores, gradients, or thresholds) was performed and fixed independently of observing edit success; post-hoc selection of neurons whose output vectors are easy to redirect would artifactually inflate the gap.

Authors: We agree that an explicit independence guarantee is required to support the claim. Neuron localization was performed using a fixed, pre-specified gradient-based importance scoring procedure applied to the loss on the target continuation; this scoring and threshold selection occurred for all sequences prior to any editing experiments, with the same neuron sets then used for both zero ablation and output vector editing. No post-hoc filtering based on edit success was applied. We will add a dedicated Methods subsection with a step-by-step description, pseudocode, and a diagram illustrating the separation between localization and editing phases. revision: yes
Referee: [Abstract] Abstract: The evaluation rests on 6831 mined memorized sequences, yet no description is given of the mining/filtering algorithm, the exact criterion for deeming a continuation 'memorized,' the search procedure over the pretraining corpus, or any exclusion rules. This directly affects reproducibility of the 87.9% suppression figure and the claimed generality.

Authors: We will include a complete description of the mining procedure in the revised manuscript. This will cover the search algorithm over the pretraining corpus, the exact memorization criterion (exact continuation match), all filtering steps, and exclusion rules. The details will appear in the Experimental Setup section with pseudocode to enable full reproducibility of the 6831-sequence set. revision: yes
Referee: [Abstract] Abstract: Key quantitative results (87.9% suppression, 2.7× gap, 96.5% ensemble coverage, 81.5% single-mode) are reported without error bars, standard deviations across runs or sequences, or statistical significance tests, making it impossible to assess whether the reported advantage over zero ablation is reliable.

Authors: We agree that statistical reporting is necessary. In the revision we will add standard deviations across the 6831 sequences (and across edit modes where relevant) to all reported figures and tables, include error bars in the main result plots, and report the outcome of a paired statistical test (Wilcoxon signed-rank) on the per-sequence suppression rates to establish the reliability of the 2.7× gap. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical comparison to zero ablation provides independent benchmark

full rationale

The paper's central result is an empirical demonstration that output-vector editing on located neurons achieves 2.7× the suppression of zero ablation on the identical neurons. This comparison is external to the editing procedure itself and does not reduce any claimed prediction or uniqueness result to a fitted parameter or self-citation by construction. No equations, ansatzes, or localization criteria are shown to be defined in terms of the final suppression metric, and the method is evaluated across multiple models with reported transfer of ordering and trade-offs. The derivation chain therefore remains self-contained against the stated benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 0 axioms · 0 invented entities

Empirical ML safety paper; no mathematical axioms or invented physical entities. Free parameters are implicit in neuron selection thresholds, edit-mode hyperparameters, and the definition of 'memorized sequence'.

free parameters (2)

neuron selection threshold or count
Paper selects 'a small set of MLP neurons' responsible for each memorized continuation; the criterion is not stated in the abstract and is likely tuned or thresholded.
distractor strength or optimization constraint
Constrained-optimization edit whose exact regularization or step-size parameters are not reported.

pith-pipeline@v0.9.1-grok · 5878 in / 1432 out tokens · 32299 ms · 2026-06-26T21:21:29.299337+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

51 extracted references · 29 canonical work pages · 2 internal anchors

[1]

Extracting Training Data from Large Language Models , booktitle =

Nicholas Carlini and Florian Tram. Extracting Training Data from Large Language Models , booktitle =. 2021 , url =

2021
[2]

Quantifying Memorization Across Neural Language Models , booktitle =

Nicholas Carlini and Daphne Ippolito and Matthew Jagielski and Katherine Lee and Florian Tram. Quantifying Memorization Across Neural Language Models , booktitle =. 2023 , url =

2023
[3]

CoRR , volume =

Valentin Hartmann and Anshuman Suri and Vincent Bindschaedler and David Evans and Shruti Tople and Robert West , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2310.18362 , eprinttype =. 2310.18362 , timestamp =

work page doi:10.48550/arxiv.2310.18362 2023
[4]

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats , booktitle =

Leo Schwinn and David Dobre and Stephan G. Adversarial Attacks and Defenses in Large Language Models: Old and New Threats , booktitle =. 2023 , url =

2023
[5]

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N

Katherine Lee and Daphne Ippolito and Andrew Nystrom and Chiyuan Zhang and Douglas Eck and Chris Callison. Deduplicating Training Data Makes Language Models Better , booktitle =. 2022 , url =. doi:10.18653/V1/2022.ACL-LONG.577 , timestamp =

work page doi:10.18653/v1/2022.acl-long.577 2022
[6]

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs , booktitle =

Abhimanyu Hans and John Kirchenbauer and Yuxin Wen and Neel Jain and Hamid Kazemi and Prajwal Singhania and Siddharth Singh and Gowthami Somepalli and Jonas Geiping and Abhinav Bhatele and Tom Goldstein , editor =. Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs , booktitle =. 2024 , url =

2024
[7]

CoRR , volume =

Alexander Xiong and Xuandong Zhao and Aneesh Pappu and Dawn Song , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.05578 , eprinttype =. 2507.05578 , timestamp =

work page doi:10.48550/arxiv.2507.05578 2025
[8]

Knowledge Neurons in Pretrained Transformers

Damai Dai and Li Dong and Yaru Hao and Zhifang Sui and Baobao Chang and Furu Wei , editor =. Knowledge Neurons in Pretrained Transformers , booktitle =. 2022 , url =. doi:10.18653/V1/2022.ACL-LONG.581 , timestamp =

work page doi:10.18653/v1/2022.acl-long.581 2022
[9]

Locating and Editing Factual Associations in

Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , editor =. Locating and Editing Factual Associations in. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , year =

2022
[10]

Andonian and Yonatan Belinkov and David Bau , title =

Kevin Meng and Arnab Sen Sharma and Alex J. Andonian and Yonatan Belinkov and David Bau , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023
[11]

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva and Roei Schuster and Jonathan Berant and Omer Levy , editor =. Transformer Feed-Forward Layers Are Key-Value Memories , booktitle =. 2021 , url =. doi:10.18653/V1/2021.EMNLP-MAIN.446 , timestamp =

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021
[12]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham and Aidan Ewart and Logan Riggs Smith and Robert Huben and Lee Sharkey , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.08600 , eprinttype =. 2309.08600 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600 2023
[13]

Information Flow Routes: Automatically Interpreting Language Models at Scale , booktitle =

Javier Ferrando and Elena Voita , editor =. Information Flow Routes: Automatically Interpreting Language Models at Scale , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.965 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.965 2024
[14]

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs

Peter Hase and Mohit Bansal and Been Kim and Asma Ghandeharioun , editor =. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models , booktitle =. 2023 , url =

2023
[15]

The Twelfth International Conference on Learning Representations,

Jingcheng Niu and Andrew Liu and Zining Zhu and Gerald Penn , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024
[16]

The butterfly effect of model editing: Few edits can trigger large language models collapse

Wanli Yang and Fei Sun and Xinyu Ma and Xun Liu and Dawei Yin and Xueqi Cheng , editor =. The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse , booktitle =. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-ACL.322 , timestamp =

work page doi:10.18653/v1/2024.findings-acl.322 2024
[17]

Demystifying Verbatim Memorization in Large Language Models , booktitle =

Jing Huang and Diyi Yang and Christopher Potts , editor =. Demystifying Verbatim Memorization in Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.598 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.598 2024
[18]

CoRR , volume =

Ali Satvaty and Suzan Verberne and Fatih Turkmen , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.02650 , eprinttype =. 2410.02650 , timestamp =

work page doi:10.48550/arxiv.2410.02650 2024
[19]

Deduplicating Training Data Mitigates Privacy Risks in Language Models , booktitle =

Nikhil Kandpal and Eric Wallace and Colin Raffel , editor =. Deduplicating Training Data Mitigates Privacy Risks in Language Models , booktitle =. 2022 , url =

2022
[20]

Copyright Traps for Large Language Models , booktitle =

Matthieu Meeus and Igor Shilov and Manuel Faysse and Yves. Copyright Traps for Large Language Models , booktitle =. 2024 , url =

2024
[21]

Understanding Transformer Memorization Recall Through Idioms

Adi Haviv and Ido Cohen and Jacob Gidron and Roei Schuster and Yoav Goldberg and Mor Geva , editor =. Understanding Transformer Memorization Recall Through Idioms , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EACL-MAIN.19 , timestamp =

work page doi:10.18653/v1/2023.eacl-main.19 2023
[22]

CoRR , volume =

Niklas Stoehr and Mitchell Gordon and Chiyuan Zhang and Owen Lewis , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2403.19851 , eprinttype =. 2403.19851 , timestamp =

work page doi:10.48550/arxiv.2403.19851 2024
[23]

CoRR , volume =

Ilya Lasy and Peter Knees and Stefan Woltran , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.21588 , eprinttype =. 2506.21588 , timestamp =

work page doi:10.48550/arxiv.2506.21588 2025
[24]

Zico Kolter and Chiyuan Zhang , editor =

Pratyush Maini and Michael Curtis Mozer and Hanie Sedghi and Zachary Chase Lipton and J. Zico Kolter and Chiyuan Zhang , editor =. Can Neural Network Memorization Be Localized? , booktitle =. 2023 , url =

2023
[25]

Do Localization Methods Actually Localize Memorized Data in LLM s? A Tale of Two Benchmarks

Ting. Do Localization Methods Actually Localize Memorized Data in. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.176 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.176 2024
[26]

CoRR , volume =

Manan Suri and Nishit Anand and Amisha Bhaskar , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.06040 , eprinttype =. 2503.06040 , timestamp =

work page doi:10.48550/arxiv.2503.06040 2025
[27]

Hudson and Caleb Geniesse and Kyle Chard and Yaoqing Yang and Ian T

Mansi Sakarvadia and Aswathy Ajith and Arham Mushtaq Khan and Nathaniel C. Hudson and Caleb Geniesse and Kyle Chard and Yaoqing Yang and Ian T. Foster and Michael W. Mahoney , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[28]

Fortieth

Zhenliang Zhang and Xinyu Hu and Xiaojun Wan , editor =. Fortieth. 2026 , url =. doi:10.1609/AAAI.V40I41.40793 , timestamp =

work page doi:10.1609/aaai.v40i41.40793 2026
[29]

Dirk Groeneveld and Iz Beltagy and Evan Pete Walsh and Akshita Bhagia and Rodney Kinney and Oyvind Tafjord and Ananya Harsh Jha and Hamish Ivison and Ian Magnusson and Yizhong Wang and Shane Arora and David Atkinson and Russell Authur and Khyathi Raghavi Chandu and Arman Cohan and Jennifer Dumas and Yanai Elazar and Yuling Gu and Jack Hessel and Tushar Kh...

work page doi:10.18653/v1/2024.acl-long.841 2024
[30]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=
[31]

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Mor Geva and Avi Caciularu and Kevin Ro Wang and Yoav Goldberg , editor =. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , booktitle =. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.3 , timestamp =

work page doi:10.18653/v1/2022.emnlp-main.3 2022
[32]

CommonGen:

Bill Yuchen Lin and Wangchunshu Zhou and Ming Shen and Pei Zhou and Chandra Bhagavatula and Yejin Choi and Xiang Ren , editor =. CommonGen:. Findings of the Association for Computational Linguistics:. 2020 , url =. doi:10.18653/V1/2020.FINDINGS-EMNLP.165 , timestamp =

work page doi:10.18653/v1/2020.findings-emnlp.165 2020
[33]

Low-Perplexity

Arthur Wuhrmann and Andrei Kucharavy and Anastasiia Kucherenko , editor =. Low-Perplexity. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop),. 2025 , url =. doi:10.18653/V1/2025.ACL-SRW.51 , timestamp =

work page doi:10.18653/v1/2025.acl-srw.51 2025
[34]

CoRR , volume =

Andr. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.25941 , eprinttype =. 2510.25941 , timestamp =

work page doi:10.48550/arxiv.2510.25941 2025
[35]

Model editing at scale leads to gradual and catastrophic forgetting

Akshat Gupta and Anurag Rao and Gopala Anumanchipalli , editor =. Model Editing at Scale leads to Gradual and Catastrophic Forgetting , booktitle =. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-ACL.902 , timestamp =

work page doi:10.18653/v1/2024.findings-acl.902 2024
[36]

MicroEdit: Neuron-level Knowledge Disentanglement and Localization in Lifelong Model Editing , booktitle =

Shiqi Wang and Qi Wang and Runliang Niu and He Kong and Yi Chang , editor =. MicroEdit: Neuron-level Knowledge Disentanglement and Localization in Lifelong Model Editing , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.1719 , timestamp =

work page doi:10.18653/v1/2025.emnlp-main.1719 2025
[37]

GeoEdit: Geometric Knowledge Editing for Large Language Models , booktitle =

Yujie Feng and Li. GeoEdit: Geometric Knowledge Editing for Large Language Models , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.676 , timestamp =

work page doi:10.18653/v1/2025.emnlp-main.676 2025
[38]

Alaa and Thomas Hartvigsen and Gopala Anumanchipalli , editor =

Akshat Gupta and Phudish Prateepamornkul and Maochuan Lu and Ahmed M. Alaa and Thomas Hartvigsen and Gopala Anumanchipalli , editor =. Lifelong Knowledge Editing requires Better Regularization , booktitle =. 2025 , url =

2025
[39]

Wes Gurnee and Neel Nanda and Matthew Pauly and Katherine Harvey and Dmitrii Troitskii and Dimitris Bertsimas , title =. Trans. Mach. Learn. Res. , volume =. 2023 , url =

2023
[40]

Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models , booktitle =

Zheyuan Liu and Guangyao Dou and Xiangchi Yuan and Chunhui Zhang and Zhaoxuan Tan and Meng Jiang , editor =. Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models , booktitle =. 2025 , url =

2025
[41]

Wu and Virginia Smith , title =

Shengyuan Hu and Yiwei Fu and Steven Z. Wu and Virginia Smith , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025
[42]

Large Language Model Unlearning , booktitle =

Yuanshun Yao and Xiaojun Xu and Yang Liu , editor =. Large Language Model Unlearning , booktitle =. 2024 , url =

2024
[43]

2022 , url =

Neel Nanda and Joseph Bloom , title =. 2022 , url =

2022
[44]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , editor =. HellaSwag: Can a Machine Really Finish Your Sentence? , booktitle =. 2019 , url =. doi:10.18653/V1/P19-1472 , timestamp =

work page doi:10.18653/v1/p19-1472 2019
[45]

MEMIT-Merge: Addressing

Zilu Dong and Xiangqing Shen and Rui Xia , editor =. MEMIT-Merge: Addressing. Findings of the Association for Computational Linguistics,. 2025 , url =

2025
[46]

Interpreting

Nostalgebraist , journal=. Interpreting
[47]

Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy , booktitle =

Daphne Ippolito and Florian Tram. Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy , booktitle =. 2023 , url =. doi:10.18653/V1/2023.INLG-MAIN.3 , timestamp =

work page doi:10.18653/v1/2023.inlg-main.3 2023
[48]

Kassem and Omar Mahmoud and Sherif Saad , editor =

Aly M. Kassem and Omar Mahmoud and Sherif Saad , editor =. Preserving Privacy Through Dememorization: An Unlearning Technique For Mitigating Memorization Risks In Language Models , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.265 , timestamp =

work page doi:10.18653/v1/2023.emnlp-main.265 2023
[49]

CoRR , volume =

Tong Chen and Faeze Brahman and Jiacheng Liu and Niloofar Mireshghallah and Weijia Shi and Pang Wei Koh and Luke Zettlemoyer and Hannaneh Hajishirzi , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.14452 , eprinttype =. 2504.14452 , timestamp =

work page doi:10.48550/arxiv.2504.14452 2025
[50]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and Russell Authur and Ben Bogin and Khyathi Raghavi Chandu and Jennifer Dumas and Yanai Elazar and Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aaka...

work page doi:10.18653/v1/2024.acl-long.840 2024
[51]

Aarohi Srivastava and others , title =. Trans. Mach. Learn. Res. , volume =. 2023 , url =

2023

[1] [1]

Extracting Training Data from Large Language Models , booktitle =

Nicholas Carlini and Florian Tram. Extracting Training Data from Large Language Models , booktitle =. 2021 , url =

2021

[2] [2]

Quantifying Memorization Across Neural Language Models , booktitle =

Nicholas Carlini and Daphne Ippolito and Matthew Jagielski and Katherine Lee and Florian Tram. Quantifying Memorization Across Neural Language Models , booktitle =. 2023 , url =

2023

[3] [3]

CoRR , volume =

Valentin Hartmann and Anshuman Suri and Vincent Bindschaedler and David Evans and Shruti Tople and Robert West , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2310.18362 , eprinttype =. 2310.18362 , timestamp =

work page doi:10.48550/arxiv.2310.18362 2023

[4] [4]

Adversarial Attacks and Defenses in Large Language Models: Old and New Threats , booktitle =

Leo Schwinn and David Dobre and Stephan G. Adversarial Attacks and Defenses in Large Language Models: Old and New Threats , booktitle =. 2023 , url =

2023

[5] [5]

Lee, K., Ippolito, D., Nystrom, A., Zhang, C., Eck, D., Callison-Burch, C., and Carlini, N

Katherine Lee and Daphne Ippolito and Andrew Nystrom and Chiyuan Zhang and Douglas Eck and Chris Callison. Deduplicating Training Data Makes Language Models Better , booktitle =. 2022 , url =. doi:10.18653/V1/2022.ACL-LONG.577 , timestamp =

work page doi:10.18653/v1/2022.acl-long.577 2022

[6] [6]

Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs , booktitle =

Abhimanyu Hans and John Kirchenbauer and Yuxin Wen and Neel Jain and Hamid Kazemi and Prajwal Singhania and Siddharth Singh and Gowthami Somepalli and Jonas Geiping and Abhinav Bhatele and Tom Goldstein , editor =. Be like a Goldfish, Don't Memorize! Mitigating Memorization in Generative LLMs , booktitle =. 2024 , url =

2024

[7] [7]

CoRR , volume =

Alexander Xiong and Xuandong Zhao and Aneesh Pappu and Dawn Song , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2507.05578 , eprinttype =. 2507.05578 , timestamp =

work page doi:10.48550/arxiv.2507.05578 2025

[8] [8]

Knowledge Neurons in Pretrained Transformers

Damai Dai and Li Dong and Yaru Hao and Zhifang Sui and Baobao Chang and Furu Wei , editor =. Knowledge Neurons in Pretrained Transformers , booktitle =. 2022 , url =. doi:10.18653/V1/2022.ACL-LONG.581 , timestamp =

work page doi:10.18653/v1/2022.acl-long.581 2022

[9] [9]

Locating and Editing Factual Associations in

Kevin Meng and David Bau and Alex Andonian and Yonatan Belinkov , editor =. Locating and Editing Factual Associations in. Advances in Neural Information Processing Systems 35: Annual Conference on Neural Information Processing Systems 2022, NeurIPS 2022, New Orleans, LA, USA, November 28 - December 9, 2022 , year =

2022

[10] [10]

Andonian and Yonatan Belinkov and David Bau , title =

Kevin Meng and Arnab Sen Sharma and Alex J. Andonian and Yonatan Belinkov and David Bau , title =. The Eleventh International Conference on Learning Representations,. 2023 , url =

2023

[11] [11]

Transformer Feed-Forward Layers Are Key-Value Memories

Mor Geva and Roei Schuster and Jonathan Berant and Omer Levy , editor =. Transformer Feed-Forward Layers Are Key-Value Memories , booktitle =. 2021 , url =. doi:10.18653/V1/2021.EMNLP-MAIN.446 , timestamp =

work page internal anchor Pith review doi:10.18653/v1/2021.emnlp-main.446 2021

[12] [12]

Sparse Autoencoders Find Highly Interpretable Features in Language Models

Hoagy Cunningham and Aidan Ewart and Logan Riggs Smith and Robert Huben and Lee Sharkey , title =. CoRR , volume =. 2023 , url =. doi:10.48550/ARXIV.2309.08600 , eprinttype =. 2309.08600 , timestamp =

work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2309.08600 2023

[13] [13]

Information Flow Routes: Automatically Interpreting Language Models at Scale , booktitle =

Javier Ferrando and Elena Voita , editor =. Information Flow Routes: Automatically Interpreting Language Models at Scale , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.965 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.965 2024

[14] [14]

Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs

Peter Hase and Mohit Bansal and Been Kim and Asma Ghandeharioun , editor =. Does Localization Inform Editing? Surprising Differences in Causality-Based Localization vs. Knowledge Editing in Language Models , booktitle =. 2023 , url =

2023

[15] [15]

The Twelfth International Conference on Learning Representations,

Jingcheng Niu and Andrew Liu and Zining Zhu and Gerald Penn , title =. The Twelfth International Conference on Learning Representations,. 2024 , url =

2024

[16] [16]

The butterfly effect of model editing: Few edits can trigger large language models collapse

Wanli Yang and Fei Sun and Xinyu Ma and Xun Liu and Dawei Yin and Xueqi Cheng , editor =. The Butterfly Effect of Model Editing: Few Edits Can Trigger Large Language Models Collapse , booktitle =. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-ACL.322 , timestamp =

work page doi:10.18653/v1/2024.findings-acl.322 2024

[17] [17]

Demystifying Verbatim Memorization in Large Language Models , booktitle =

Jing Huang and Diyi Yang and Christopher Potts , editor =. Demystifying Verbatim Memorization in Large Language Models , booktitle =. 2024 , url =. doi:10.18653/V1/2024.EMNLP-MAIN.598 , timestamp =

work page doi:10.18653/v1/2024.emnlp-main.598 2024

[18] [18]

CoRR , volume =

Ali Satvaty and Suzan Verberne and Fatih Turkmen , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2410.02650 , eprinttype =. 2410.02650 , timestamp =

work page doi:10.48550/arxiv.2410.02650 2024

[19] [19]

Deduplicating Training Data Mitigates Privacy Risks in Language Models , booktitle =

Nikhil Kandpal and Eric Wallace and Colin Raffel , editor =. Deduplicating Training Data Mitigates Privacy Risks in Language Models , booktitle =. 2022 , url =

2022

[20] [20]

Copyright Traps for Large Language Models , booktitle =

Matthieu Meeus and Igor Shilov and Manuel Faysse and Yves. Copyright Traps for Large Language Models , booktitle =. 2024 , url =

2024

[21] [21]

Understanding Transformer Memorization Recall Through Idioms

Adi Haviv and Ido Cohen and Jacob Gidron and Roei Schuster and Yoav Goldberg and Mor Geva , editor =. Understanding Transformer Memorization Recall Through Idioms , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EACL-MAIN.19 , timestamp =

work page doi:10.18653/v1/2023.eacl-main.19 2023

[22] [22]

CoRR , volume =

Niklas Stoehr and Mitchell Gordon and Chiyuan Zhang and Owen Lewis , title =. CoRR , volume =. 2024 , url =. doi:10.48550/ARXIV.2403.19851 , eprinttype =. 2403.19851 , timestamp =

work page doi:10.48550/arxiv.2403.19851 2024

[23] [23]

CoRR , volume =

Ilya Lasy and Peter Knees and Stefan Woltran , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2506.21588 , eprinttype =. 2506.21588 , timestamp =

work page doi:10.48550/arxiv.2506.21588 2025

[24] [24]

Zico Kolter and Chiyuan Zhang , editor =

Pratyush Maini and Michael Curtis Mozer and Hanie Sedghi and Zachary Chase Lipton and J. Zico Kolter and Chiyuan Zhang , editor =. Can Neural Network Memorization Be Localized? , booktitle =. 2023 , url =

2023

[25] [25]

Do Localization Methods Actually Localize Memorized Data in LLM s? A Tale of Two Benchmarks

Ting. Do Localization Methods Actually Localize Memorized Data in. Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers),. 2024 , url =. doi:10.18653/V1/2024.NAACL-LONG.176 , timestamp =

work page doi:10.18653/v1/2024.naacl-long.176 2024

[26] [26]

CoRR , volume =

Manan Suri and Nishit Anand and Amisha Bhaskar , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2503.06040 , eprinttype =. 2503.06040 , timestamp =

work page doi:10.48550/arxiv.2503.06040 2025

[27] [27]

Hudson and Caleb Geniesse and Kyle Chard and Yaoqing Yang and Ian T

Mansi Sakarvadia and Aswathy Ajith and Arham Mushtaq Khan and Nathaniel C. Hudson and Caleb Geniesse and Kyle Chard and Yaoqing Yang and Ian T. Foster and Michael W. Mahoney , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[28] [28]

Fortieth

Zhenliang Zhang and Xinyu Hu and Xiaojun Wan , editor =. Fortieth. 2026 , url =. doi:10.1609/AAAI.V40I41.40793 , timestamp =

work page doi:10.1609/aaai.v40i41.40793 2026

[29] [29]

Dirk Groeneveld and Iz Beltagy and Evan Pete Walsh and Akshita Bhagia and Rodney Kinney and Oyvind Tafjord and Ananya Harsh Jha and Hamish Ivison and Ian Magnusson and Yizhong Wang and Shane Arora and David Atkinson and Russell Authur and Khyathi Raghavi Chandu and Arman Cohan and Jennifer Dumas and Yanai Elazar and Yuling Gu and Jack Hessel and Tushar Kh...

work page doi:10.18653/v1/2024.acl-long.841 2024

[30] [30]

Transformer Circuits Thread , volume=

A mathematical framework for transformer circuits , author=. Transformer Circuits Thread , volume=

[31] [31]

Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space

Mor Geva and Avi Caciularu and Kevin Ro Wang and Yoav Goldberg , editor =. Transformer Feed-Forward Layers Build Predictions by Promoting Concepts in the Vocabulary Space , booktitle =. 2022 , url =. doi:10.18653/V1/2022.EMNLP-MAIN.3 , timestamp =

work page doi:10.18653/v1/2022.emnlp-main.3 2022

[32] [32]

CommonGen:

Bill Yuchen Lin and Wangchunshu Zhou and Ming Shen and Pei Zhou and Chandra Bhagavatula and Yejin Choi and Xiang Ren , editor =. CommonGen:. Findings of the Association for Computational Linguistics:. 2020 , url =. doi:10.18653/V1/2020.FINDINGS-EMNLP.165 , timestamp =

work page doi:10.18653/v1/2020.findings-emnlp.165 2020

[33] [33]

Low-Perplexity

Arthur Wuhrmann and Andrei Kucharavy and Anastasiia Kucherenko , editor =. Low-Perplexity. Proceedings of the 63rd Annual Meeting of the Association for Computational Linguistics (Volume 4: Student Research Workshop),. 2025 , url =. doi:10.18653/V1/2025.ACL-SRW.51 , timestamp =

work page doi:10.18653/v1/2025.acl-srw.51 2025

[34] [34]

CoRR , volume =

Andr. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2510.25941 , eprinttype =. 2510.25941 , timestamp =

work page doi:10.48550/arxiv.2510.25941 2025

[35] [35]

Model editing at scale leads to gradual and catastrophic forgetting

Akshat Gupta and Anurag Rao and Gopala Anumanchipalli , editor =. Model Editing at Scale leads to Gradual and Catastrophic Forgetting , booktitle =. 2024 , url =. doi:10.18653/V1/2024.FINDINGS-ACL.902 , timestamp =

work page doi:10.18653/v1/2024.findings-acl.902 2024

[36] [36]

MicroEdit: Neuron-level Knowledge Disentanglement and Localization in Lifelong Model Editing , booktitle =

Shiqi Wang and Qi Wang and Runliang Niu and He Kong and Yi Chang , editor =. MicroEdit: Neuron-level Knowledge Disentanglement and Localization in Lifelong Model Editing , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.1719 , timestamp =

work page doi:10.18653/v1/2025.emnlp-main.1719 2025

[37] [37]

GeoEdit: Geometric Knowledge Editing for Large Language Models , booktitle =

Yujie Feng and Li. GeoEdit: Geometric Knowledge Editing for Large Language Models , booktitle =. 2025 , url =. doi:10.18653/V1/2025.EMNLP-MAIN.676 , timestamp =

work page doi:10.18653/v1/2025.emnlp-main.676 2025

[38] [38]

Alaa and Thomas Hartvigsen and Gopala Anumanchipalli , editor =

Akshat Gupta and Phudish Prateepamornkul and Maochuan Lu and Ahmed M. Alaa and Thomas Hartvigsen and Gopala Anumanchipalli , editor =. Lifelong Knowledge Editing requires Better Regularization , booktitle =. 2025 , url =

2025

[39] [39]

Wes Gurnee and Neel Nanda and Matthew Pauly and Katherine Harvey and Dmitrii Troitskii and Dimitris Bertsimas , title =. Trans. Mach. Learn. Res. , volume =. 2023 , url =

2023

[40] [40]

Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models , booktitle =

Zheyuan Liu and Guangyao Dou and Xiangchi Yuan and Chunhui Zhang and Zhaoxuan Tan and Meng Jiang , editor =. Modality-Aware Neuron Pruning for Unlearning in Multimodal Large Language Models , booktitle =. 2025 , url =

2025

[41] [41]

Wu and Virginia Smith , title =

Shengyuan Hu and Yiwei Fu and Steven Z. Wu and Virginia Smith , title =. The Thirteenth International Conference on Learning Representations,. 2025 , url =

2025

[42] [42]

Large Language Model Unlearning , booktitle =

Yuanshun Yao and Xiaojun Xu and Yang Liu , editor =. Large Language Model Unlearning , booktitle =. 2024 , url =

2024

[43] [43]

2022 , url =

Neel Nanda and Joseph Bloom , title =. 2022 , url =

2022

[44] [44]

HellaSwag: Can a machine really finish your sentence? InProceedings of the 57th Annual Meeting of the Association for Computational Linguistics, pages 4791–4800, 2019

Rowan Zellers and Ari Holtzman and Yonatan Bisk and Ali Farhadi and Yejin Choi , editor =. HellaSwag: Can a Machine Really Finish Your Sentence? , booktitle =. 2019 , url =. doi:10.18653/V1/P19-1472 , timestamp =

work page doi:10.18653/v1/p19-1472 2019

[45] [45]

MEMIT-Merge: Addressing

Zilu Dong and Xiangqing Shen and Rui Xia , editor =. MEMIT-Merge: Addressing. Findings of the Association for Computational Linguistics,. 2025 , url =

2025

[46] [46]

Interpreting

Nostalgebraist , journal=. Interpreting

[47] [47]

Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy , booktitle =

Daphne Ippolito and Florian Tram. Preventing Generation of Verbatim Memorization in Language Models Gives a False Sense of Privacy , booktitle =. 2023 , url =. doi:10.18653/V1/2023.INLG-MAIN.3 , timestamp =

work page doi:10.18653/v1/2023.inlg-main.3 2023

[48] [48]

Kassem and Omar Mahmoud and Sherif Saad , editor =

Aly M. Kassem and Omar Mahmoud and Sherif Saad , editor =. Preserving Privacy Through Dememorization: An Unlearning Technique For Mitigating Memorization Risks In Language Models , booktitle =. 2023 , url =. doi:10.18653/V1/2023.EMNLP-MAIN.265 , timestamp =

work page doi:10.18653/v1/2023.emnlp-main.265 2023

[49] [49]

CoRR , volume =

Tong Chen and Faeze Brahman and Jiacheng Liu and Niloofar Mireshghallah and Weijia Shi and Pang Wei Koh and Luke Zettlemoyer and Hannaneh Hajishirzi , title =. CoRR , volume =. 2025 , url =. doi:10.48550/ARXIV.2504.14452 , eprinttype =. 2504.14452 , timestamp =

work page doi:10.48550/arxiv.2504.14452 2025

[50] [50]

Proceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers) , pages =

Luca Soldaini and Rodney Kinney and Akshita Bhagia and Dustin Schwenk and David Atkinson and Russell Authur and Ben Bogin and Khyathi Raghavi Chandu and Jennifer Dumas and Yanai Elazar and Valentin Hofmann and Ananya Harsh Jha and Sachin Kumar and Li Lucy and Xinxi Lyu and Nathan Lambert and Ian Magnusson and Jacob Morrison and Niklas Muennighoff and Aaka...

work page doi:10.18653/v1/2024.acl-long.840 2024

[51] [51]

Aarohi Srivastava and others , title =. Trans. Mach. Learn. Res. , volume =. 2023 , url =

2023