arxiv: 2202.05262 · v5 · submitted 2022-02-10 · 💻 cs.CL · cs.LG

Recognition: 2 theorem links

· Lean Theorem

Locating and Editing Factual Associations in GPT

Kevin Meng , David Bau , Alex Andonian , Yonatan Belinkov

Authors on Pith no claims yet

Pith reviewed 2026-05-16 09:55 UTC · model grok-4.3

classification 💻 cs.CL cs.LG

keywords model editingfactual associationstransformer language modelscausal interventionfeed-forward modulesROMEzero-shot relation extractioncounterfactual evaluation

0 comments

The pith

Factual associations in GPT models are stored in localized mid-layer feed-forward computations that can be directly edited via rank-one weight updates.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that factual recall in autoregressive transformers occurs through a distinct set of steps in middle-layer feed-forward modules while processing subject tokens. A causal intervention identifies the neuron activations decisive for specific predictions, revealing these modules as the primary storage site. The authors then introduce Rank-One Model Editing to modify the relevant weights and update targeted associations. This method performs comparably to existing techniques on standard zero-shot relation extraction tasks and outperforms them on a new counterfactual dataset by preserving both specificity and generalization.

Core claim

Factual associations correspond to localized, directly-editable computations in mid-layer feed-forward modules. Causal tracing isolates the decisive activations during subject token processing, and a rank-one update to those feed-forward weights successfully revises specific associations while leaving broader model behavior intact.

What carries the argument

Rank-One Model Editing (ROME), a rank-one weight update applied to feed-forward modules identified by causal intervention as mediating factual recall.

If this is right

Mid-layer feed-forward modules serve as a primary locus for storing and recalling factual associations during subject token processing.
Direct weight manipulation can achieve precise factual edits while maintaining performance on standard benchmarks.
ROME simultaneously achieves specificity and generalization on counterfactual assertions where other editing methods trade one off against the other.
The identified computations form a distinct sequence of steps that mediate factual predictions inside the transformer.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar localization and editing techniques could extend to correcting factual errors or biases without full retraining.
The approach implies that other forms of stored knowledge, such as procedural or linguistic patterns, might be isolatable through comparable causal tracing.
Precise editing opens questions about whether repeated interventions accumulate side effects over multiple facts.

Load-bearing premise

The causal intervention accurately isolates the decisive feed-forward computations for factual recall and the rank-one update changes the targeted association without creating unmeasured side effects on the rest of the model's behavior.

What would settle it

An experiment in which applying the rank-one update to the identified feed-forward weights either fails to change the model's output on the targeted factual query or produces large unintended shifts in predictions on unrelated inputs.

read the original abstract

We analyze the storage and recall of factual associations in autoregressive transformer language models, finding evidence that these associations correspond to localized, directly-editable computations. We first develop a causal intervention for identifying neuron activations that are decisive in a model's factual predictions. This reveals a distinct set of steps in middle-layer feed-forward modules that mediate factual predictions while processing subject tokens. To test our hypothesis that these computations correspond to factual association recall, we modify feed-forward weights to update specific factual associations using Rank-One Model Editing (ROME). We find that ROME is effective on a standard zero-shot relation extraction (zsRE) model-editing task, comparable to existing methods. To perform a more sensitive evaluation, we also evaluate ROME on a new dataset of counterfactual assertions, on which it simultaneously maintains both specificity and generalization, whereas other methods sacrifice one or another. Our results confirm an important role for mid-layer feed-forward modules in storing factual associations and suggest that direct manipulation of computational mechanisms may be a feasible approach for model editing. The code, dataset, visualizations, and an interactive demo notebook are available at https://rome.baulab.info/

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

ROME gives a practical way to locate factual recall in GPT's mid-layer FFNs and edit it with a rank-one weight update, holding up better than baselines on their new counterfactual test set.

read the letter

The main thing here is that they trace factual associations to specific mid-layer feed-forward steps using causal interventions, then edit those associations directly with ROME. The tracing restores clean activations into noised runs to measure impact on next-token predictions, and the edit applies a rank-one update to the output weights at the identified layer. On the standard zsRE task it matches existing editors, but the real test is their new counterfactual assertions dataset, where it keeps both generalization and specificity while other methods lose one or the other.

Referee Report

2 major / 2 minor

Summary. The paper claims that factual associations in autoregressive transformers are stored as localized computations in mid-layer feed-forward modules. It introduces causal tracing to identify decisive neuron activations during subject-token processing and Rank-One Model Editing (ROME) to perform targeted rank-one updates to the corresponding weight matrices. Experiments demonstrate that ROME matches existing methods on the zsRE editing benchmark and, on a new counterfactual dataset, simultaneously preserves specificity and generalization where baselines trade one off for the other. Public code, data, and visualizations are released to support the localization and editing results.

Significance. If the central empirical claims hold, the work is significant for providing both a mechanistic account of factual recall and a practical, low-cost editing procedure that avoids full retraining. The combination of causal interventions, a new evaluation dataset designed to probe the specificity-generalization tradeoff, and fully public artifacts strengthens the contribution to model interpretability and controllable editing in the NLP community.

major comments (2)

[§4] §4 (counterfactual evaluation): Specificity is assessed only against a curated collection of relations and counterfactual prompts. No direct measurement of distributional side effects (e.g., KL divergence between original and edited model outputs on unrelated prompts) is reported. This gap is load-bearing for the claim that ROME maintains specificity without unmeasured side effects.
[§3] §3 (causal tracing): The restoration of clean-run activations into noised runs isolates mid-layer FFN effects on next-token prediction, yet the method does not include controls that would distinguish factual-association signals from general subject-token processing. If the latter is also restored, the localization result and the subsequent choice of layers for ROME could be overstated.

minor comments (2)

[Table 1] Table 1 and Figure 3: Axis labels and legend entries use inconsistent abbreviations for baseline methods; a single consistent naming convention would improve readability.
[§5.3] §5.3: The discussion of limitations mentions only zsRE and counterfactual datasets; an explicit statement of the scope of models tested (e.g., GPT-2 vs. larger variants) would clarify generalizability.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive and detailed feedback. We address each major comment below and indicate where revisions will be made to strengthen the manuscript.

read point-by-point responses

Referee: [§4] §4 (counterfactual evaluation): Specificity is assessed only against a curated collection of relations and counterfactual prompts. No direct measurement of distributional side effects (e.g., KL divergence between original and edited model outputs on unrelated prompts) is reported. This gap is load-bearing for the claim that ROME maintains specificity without unmeasured side effects.

Authors: We agree that a direct measurement of distributional side effects would provide stronger evidence for the specificity claim. In the revised manuscript we will add an analysis computing the average KL divergence between the original and ROME-edited model outputs on a large held-out set of unrelated prompts drawn from the CounterFact corpus (approximately 10k prompts). This will be reported alongside the existing specificity metrics to quantify any unmeasured side effects. revision: yes
Referee: [§3] §3 (causal tracing): The restoration of clean-run activations into noised runs isolates mid-layer FFN effects on next-token prediction, yet the method does not include controls that would distinguish factual-association signals from general subject-token processing. If the latter is also restored, the localization result and the subsequent choice of layers for ROME could be overstated.

Authors: The causal tracing procedure is applied exclusively to factual queries and measures recovery of the correct object token logit after subject-token noising. While subject tokens participate in general processing, the intervention is designed to disrupt factual recall specifically, and the effect size is evaluated only on the factual prediction. We acknowledge that additional controls (e.g., applying the same tracing to non-factual subject-token tasks) would help isolate factual signals more cleanly. We will add a limitations paragraph discussing this point and note that the localization is consistent across hundreds of distinct factual relations, which would be unlikely if the signal were purely general subject processing. revision: partial

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's central claims rest on empirical causal tracing experiments that measure activation effects on next-token predictions and on downstream editing success rates measured against external benchmarks (zsRE and a new counterfactual dataset). These outcomes are not defined in terms of the fitted parameters or rank-one updates themselves; the localization to mid-layer FFN modules and the specificity/generalization tradeoff are reported as measured quantities rather than tautological consequences of the method definition. No self-citation chains, uniqueness theorems, or ansatzes are invoked to force the result, and the derivation remains self-contained against the reported experimental protocol.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The analysis relies on the standard transformer architecture and the assumption that factual recall can be isolated via causal interventions on activations. No new physical entities or ad-hoc constants are introduced beyond the definition of the ROME update rule itself.

axioms (1)

standard math Autoregressive transformers compute next-token predictions through stacked attention and feed-forward layers.
Standard architectural assumption invoked throughout the causal tracing and editing sections.

pith-pipeline@v0.9.0 · 5497 in / 1285 out tokens · 43071 ms · 2026-05-16T09:55:25.774121+00:00 · methodology

discussion (0)

Forward citations

Cited by 19 Pith papers

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Slot Machines: How LLMs Keep Track of Multiple Entities
cs.CL 2026-04 unverdicted novelty 8.0

LLM activations encode current and prior entities in orthogonal slots, but models only use the current slot for explicit factual retrieval despite prior-slot information being linearly decodable.
Progress measures for grokking via mechanistic interpretability
cs.LG 2023-01 accept novelty 8.0

Grokking arises from gradual amplification of a Fourier-based circuit in the weights followed by removal of memorizing components.
Interpretability in the Wild: a Circuit for Indirect Object Identification in GPT-2 small
cs.LG 2022-11 conditional novelty 8.0

GPT-2 small solves indirect object identification via a circuit of 26 attention heads organized into seven functional classes discovered through causal interventions.
PermaFrost-Attack: Stealth Pretraining Seeding(SPS) for planting Logic Landmines During LLM Training
cs.LG 2026-04 unverdicted novelty 7.0

Stealth Pretraining Seeding plants persistent unsafe behaviors in LLMs via diffuse poisoned web content that activates on precise triggers and evades standard evaluation.
DSCA: Dynamic Subspace Concept Alignment for Lifelong VLM Editing
cs.CV 2026-04 unverdicted novelty 7.0

DSCA turns concept isolation into an architectural property by dynamically creating orthogonal subspaces for non-interfering lifelong edits in vision-language models, sustaining over 95% success after 1000 sequential edits.
Norm Anchors Make Model Edits Last
cs.LG 2026-01 conditional novelty 7.0

Norm-Anchor Scaling breaks the norm-feedback loop in sequential LLM editing by anchoring value vectors to original norms, improving long-run performance by 72.2% and extending the editing horizon over 4x.
Steering Language Models With Activation Engineering
cs.CL 2023-08 unverdicted novelty 7.0

Activation Addition steers language models by adding contrastive activation vectors from prompt pairs to control high-level properties like sentiment and toxicity at inference time without training.
Not Just RLHF: Why Alignment Alone Won't Fix Multi-Agent Sycophancy
cs.LG 2026-05 unverdicted novelty 6.0

Pretrained base models exhibit higher yield to peer disagreement than RLHF instruct variants, with the effect localized to mid-layer attention and mitigated by structured dissent rather than prompt defenses.
Correcting Influence: Unboxing LLM Outputs with Orthogonal Latent Spaces
cs.LG 2026-05 unverdicted novelty 6.0

A latent mediation framework with sparse autoencoders enables non-additive token-level influence attribution in LLMs by learning orthogonal features and back-propagating attributions.
Why Do LLMs Struggle in Strategic Play? Broken Links Between Observations, Beliefs, and Actions
cs.CL 2026-04 unverdicted novelty 6.0

LLMs encode accurate but brittle internal beliefs about latent game states and convert them poorly into actions, creating systematic gaps that explain strategic failures.
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 conditional novelty 6.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering raise deep-conflict accura...
Dissociating Decodability and Causal Use in Bracket-Sequence Transformers
cs.CL 2026-04 unverdicted novelty 6.0

In Dyck-language transformers, attention patterns causally use top-of-stack information while residual-stream depth and distance signals are decodable yet causally inert.
Co-Located Tests, Better AI Code: How Test Syntax Structure Affects Foundation Model Code Generation
cs.SE 2026-04 unverdicted novelty 6.0

Co-locating tests with implementation code yields substantially higher preservation and correctness in foundation-model-generated programs than separated test syntax.
State Transfer Reveals Reuse in Controlled Routing
cs.AI 2026-04 unverdicted novelty 6.0

Fixed-interface state transfer provides stronger evidence of internal reuse in controlled routing than prompt retraining success alone.
Attention, May I Have Your Decision? Localizing Generative Choices in Diffusion Models
cs.CV 2026-03 unverdicted novelty 6.0

Implicit generative choices in diffusion models for ambiguous prompts are localized principally in self-attention layers, enabling a targeted ICM steering method that outperforms prior debiasing approaches.
Sparse Feature Circuits: Discovering and Editing Interpretable Causal Graphs in Language Models
cs.LG 2024-03 unverdicted novelty 6.0

Sparse feature circuits are introduced as interpretable causal subnetworks in language models, supporting unsupervised discovery of thousands of circuits and a method called SHIFT to improve classifier generalization ...
The Override Gap: A Magnitude Account of Knowledge Conflict Failure in Hypernetwork-Based Instant LLM Adaptation
cs.LG 2026-04 unverdicted novelty 5.0

Knowledge conflicts in hypernetwork LLM adaptation stem from constant adapter margins losing to frequency-dependent pretrained margins; selective layer boosting and conflict-aware triggering close the gap.
Towards Scalable Lifelong Knowledge Editing with Selective Knowledge Suppression
cs.AI 2026-04 unverdicted novelty 5.0

LightEdit enables scalable lifelong knowledge editing in LLMs via selective knowledge retrieval and probability suppression during decoding, outperforming prior methods on ZSRE, Counterfact, and RIPE while reducing tr...
MemOS: A Memory OS for AI System
cs.CL 2025-07 unverdicted novelty 5.0

MemOS introduces a unified memory management framework for LLMs using MemCubes to handle and evolve different memory types for improved controllability and evolvability.

Reference graph

Works this paper leans on

43 extracted references · 43 canonical work pages · cited by 18 Pith papers · 3 internal anchors

[1]

Fine-grained analysis of sentence embeddings using auxiliary prediction tasks

Adi, Y., Kermany, E., Belinkov, Y., Lavi, O., and Goldberg, Y. Fine-grained analysis of sentence embeddings using auxiliary prediction tasks. In International Conference on Learning Representations (ICLR), April 2017

work page 2017
[2]

Anderson, J. A. A simple neural network generating an interactive memory. Mathematical biosciences, 14 0 (3-4): 0 197--220, 1972

work page 1972
[3]

Rewriting a deep generative model

Bau, D., Liu, S., Wang, T., Zhu, J.-Y., and Torralba, A. Rewriting a deep generative model. In Proceedings of the European Conference on Computer Vision (ECCV), 2020

work page 2020
[4]

Probing Classifiers: Promises, Shortcomings, and Advances

Belinkov, Y. Probing Classifiers: Promises, Shortcomings, and Advances . Computational Linguistics, pp.\ 1--13, 11 2021. ISSN 0891-2017. doi:10.1162/coli_a_00422. URL https://doi.org/10.1162/coli\_a\_00422

work page internal anchor Pith review doi:10.1162/coli_a_00422 2021
[5]

and Glass, J

Belinkov, Y. and Glass, J. Analysis methods in neural language processing: A survey. Transactions of the Association for Computational Linguistics, 7: 0 49--72, March 2019. doi:10.1162/tacl_a_00254. URL https://aclanthology.org/Q19-1004

work page doi:10.1162/tacl_a_00254 2019
[6]

Belinkov, Y., Durrani, N., Dalvi, F., Sajjad, H., and Glass, J. What do neural machine translation models learn about morphology? In Proceedings of the 55th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 861--872, Vancouver, Canada, July 2017. Association for Computational Linguistics. doi:10.18653/v1/P17-108...

work page doi:10.18653/v1/p17-1080 2017
[7]

Brown, T., Mann, B., Ryder, N., Subbiah, M., Kaplan, J. D., Dhariwal, P., Neelakantan, A., Shyam, P., Sastry, G., Askell, A., Agarwal, S., Herbert-Voss, A., Krueger, G., Henighan, T., Child, R., Ramesh, A., Ziegler, D., Wu, J., Winter, C., Hesse, C., Chen, M., Sigler, E., Litwin, M., Gray, S., Chess, B., Clark, J., Berner, C., McCandlish, S., Radford, A.,...

work page 1901
[8]

What you can cram into a single \$&!#* vector: Probing sentence embeddings for linguistic properties

Conneau, A., Kruszewski, G., Lample, G., Barrault, L., and Baroni, M. What you can cram into a single \ & ! \# * vector: Probing sentence embeddings for linguistic properties. In Proceedings of the 56th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 2126--2136, Melbourne, Australia, July 2018. Association for...

work page doi:10.18653/v1/p18-1198 2018
[9]

Knowledge neurons in pretrained transformers

Dai, D., Dong, L., Hao, Y., Sui, Z., Chang, B., and Wei, F. Knowledge neurons in pretrained transformers. In Proceedings of the 60th Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pp.\ 8493--8502, 2022

work page 2022
[10]

Editing factual knowledge in language models

De Cao, N., Aziz, W., and Titov, I. Editing factual knowledge in language models. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 6491--6506, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-main.522

work page 2021
[11]

Devlin, M.-W

Devlin, J., Chang, M.-W., Lee, K., and Toutanova, K. BERT : Pre-training of deep bidirectional transformers for language understanding. In Proceedings of the 2019 Conference of the North A merican Chapter of the Association for Computational Linguistics: Human Language Technologies, Volume 1 (Long and Short Papers) , pp.\ 4171--4186, Minneapolis, Minnesot...

work page doi:10.18653/v1/n19-1423 2019
[12]

Measuring and Improving Consistency in Pretrained Language Models

Elazar, Y., Kassner, N., Ravfogel, S., Ravichander, A., Hovy, E., Schütze, H., and Goldberg, Y. Measuring and Improving Consistency in Pretrained Language Models . Transactions of the Association for Computational Linguistics, 9: 0 1012--1031, 09 2021 a . ISSN 2307-387X. doi:10.1162/tacl_a_00410. URL https://doi.org/10.1162/tacl\_a\_00410

work page doi:10.1162/tacl_a_00410 2021
[13]

Amnesic probing: Behavioral explanation with amnesic counterfactuals

Elazar, Y., Ravfogel, S., Jacovi, A., and Goldberg, Y. Amnesic probing: Behavioral explanation with amnesic counterfactuals. Transactions of the Association for Computational Linguistics, 9: 0 160--175, 2021 b

work page 2021
[14]

A mathematical framework for transformer circuits

Elhage, N., Nanda, N., Olsson, C., Henighan, T., Joseph, N., Mann, B., Askell, A., Bai, Y., Chen, A., Conerly, T., DasSarma, N., Drain, D., Ganguli, D., Hatfield-Dodds, Z., Hernandez, D., Jones, A., Kernion, J., Lovitt, L., Ndousse, K., Amodei, D., Brown, T., Clark, J., Kaplan, J., McCandlish, S., and Olah, C. A mathematical framework for transformer circ...

work page 2021
[15]

Probing for semantic evidence of composition by means of simple classification tasks

Ettinger, A., Elgohary, A., and Resnik, P. Probing for semantic evidence of composition by means of simple classification tasks. In Proceedings of the 1st Workshop on Evaluating Vector-Space Representations for NLP , pp.\ 134--139, Berlin, Germany, August 2016. Association for Computational Linguistics. doi:10.18653/v1/W16-2524. URL https://aclanthology.o...

work page doi:10.18653/v1/w16-2524 2016
[16]

CausaLM : Causal model explanation through counterfactual language models

Feder, A., Oved, N., Shalit, U., and Reichart, R. CausaLM : Causal model explanation through counterfactual language models. Computational Linguistics, 47 0 (2): 0 333--386, 2021

work page 2021
[17]

Causal analysis of syntactic agreement mechanisms in neural language models

Finlayson, M., Mueller, A., Gehrmann, S., Shieber, S., Linzen, T., and Belinkov, Y. Causal analysis of syntactic agreement mechanisms in neural language models. In Proceedings of the 59th Annual Meeting of the Association for Computational Linguistics and the 11th International Joint Conference on Natural Language Processing (Volume 1: Long Papers), pp.\ ...

work page doi:10.18653/v1/2021.acl-long.144 2021
[18]

Transformer feed-forward layers are key-value memories

Geva, M., Schuster, R., Berant, J., and Levy, O. Transformer feed-forward layers are key-value memories. In Proceedings of the 2021 Conference on Empirical Methods in Natural Language Processing, pp.\ 5484--5495, Online and Punta Cana, Dominican Republic, November 2021. Association for Computational Linguistics. URL https://aclanthology.org/2021.emnlp-main.446

work page 2021
[19]

Do language models have beliefs? methods for detecting, updating, and visualizing model beliefs

Hase, P., Diab, M., Celikyilmaz, A., Li, X., Kozareva, Z., Stoyanov, V., Bansal, M., and Iyer, S. Do language models have beliefs? methods for detecting, updating, and visualizing model beliefs. arXiv preprint arXiv:2111.13654, 2021

work page arXiv 2021
[20]

Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure

Hupkes, D., Veldhoen, S., and Zuidema, W. Visualisation and 'diagnostic classifiers' reveal how recurrent and recursive neural networks process hierarchical structure. Journal of Artificial Intelligence Research, 61: 0 907--926, 2018

work page 2018
[21]

F., Araki, J., and Neubig, G

Jiang, Z., Xu, F. F., Araki, J., and Neubig, G. How can we know what language models know? Transactions of the Association for Computational Linguistics, 8: 0 423--438, 2020. doi:10.1162/tacl_a_00324. URL https://aclanthology.org/2020.tacl-1.28

work page doi:10.1162/tacl_a_00324 2020
[22]

Kingma, D. P. and Ba, J. Adam: A method for stochastic optimization. In Bengio, Y. and LeCun, Y. (eds.), 3rd International Conference on Learning Representations, ICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings , 2015. URL http://arxiv.org/abs/1412.6980

work page internal anchor Pith review Pith/arXiv arXiv 2015
[23]

Correlation matrix memories

Kohonen, T. Correlation matrix memories. IEEE transactions on computers, 100 0 (4): 0 353--359, 1972

work page 1972
[24]

Zero-shot relation extraction via reading comprehension

Levy, O., Seo, M., Choi, E., and Zettlemoyer, L. Zero-shot relation extraction via reading comprehension. In Proceedings of the 21st Conference on Computational Natural Language Learning ( C o NLL 2017) , pp.\ 333--342, Vancouver, Canada, August 2017. Association for Computational Linguistics. doi:10.18653/v1/K17-1034. URL https://aclanthology.org/K17-1034

work page doi:10.18653/v1/k17-1034 2017
[25]

BART : Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension

Lewis, M., Liu, Y., Goyal, N., Ghazvininejad, M., Mohamed, A., Levy, O., Stoyanov, V., and Zettlemoyer, L. BART : Denoising sequence-to-sequence pre-training for natural language generation, translation, and comprehension. In Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics, pp.\ 7871--7880, Online, July 2020. Associ...

work page doi:10.18653/v1/2020.acl-main.703 2020
[26]

Mass-Editing Memory in a Transformer

Meng, K., Sen Sharma, A., Andonian, A., Belinkov, Y., and Bau, D. Mass-editing memory in a transformer. arXiv preprint arXiv:2210.07229, 2022

work page internal anchor Pith review Pith/arXiv arXiv 2022
[27]

Mitchell, E., Lin, C., Bosselut, A., Finn, C., and Manning, C. D. Fast model editing at scale. In International Conference on Learning Representations, 2021

work page 2021
[28]

Direct and indirect effects

Pearl, J. Direct and indirect effects. In Proceedings of the Seventeenth conference on Uncertainty in artificial intelligence, pp.\ 411--420, 2001

work page 2001
[29]

Causality: Models, Reasoning and Inference

Pearl, J. Causality: Models, Reasoning and Inference. Cambridge University Press, USA, 2nd edition, 2009. ISBN 052189560X

work page 2009
[30]

Petroni, F., Rockt \"a schel, T., Riedel, S., Lewis, P., Bakhtin, A., Wu, Y., and Miller, A. Language models as knowledge bases? In Proceedings of the 2019 Conference on Empirical Methods in Natural Language Processing and the 9th International Joint Conference on Natural Language Processing (EMNLP-IJCNLP), pp.\ 2463--2473, Hong Kong, China, November 2019...

work page doi:10.18653/v1/d19-1250 2019
[31]

H., and Riedel, S

Petroni, F., Lewis, P., Piktus, A., Rockt \"a schel, T., Wu, Y., Miller, A. H., and Riedel, S. How context affects language models' factual predictions. In Automated Knowledge Base Construction, 2020

work page 2020
[32]

Language models are unsupervised multitask learners

Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., Sutskever, I., et al. Language models are unsupervised multitask learners. OpenAI blog, pp.\ 9, 2019

work page 2019
[33]

Raffel, C., Shazeer, N., Roberts, A., Lee, K., Narang, S., Matena, M., Zhou, Y., Li, W., and Liu, P. J. Exploring the limits of transfer learning with a unified text-to-text transformer. Journal of Machine Learning Research, 21 0 (140): 0 1--67, 2020

work page 2020
[34]

Roberts, A., Raffel, C., and Shazeer, N. How much knowledge can you pack into the parameters of a language model? In Proceedings of the 2020 Conference on Empirical Methods in Natural Language Processing (EMNLP), pp.\ 5418--5426, Online, November 2020. Association for Computational Linguistics. doi:10.18653/v1/2020.emnlp-main.437. URL https://aclanthology...

work page doi:10.18653/v1/2020.emnlp-main.437 2020
[35]

Axiomatic attribution for deep networks

Sundararajan, M., Taly, A., and Yan, Q. Axiomatic attribution for deep networks. In International conference on machine learning, pp.\ 3319--3328. PMLR, 2017

work page 2017
[36]

N., Kaiser, ., and Polosukhin, I

Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, ., and Polosukhin, I. Attention is all you need. In Advances in neural information processing systems, pp.\ 5998--6008, 2017

work page 2017
[37]

Causal mediation analysis for interpreting neural NLP : The case of gender bias

Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Sakenis, S., Huang, J., Singer, Y., and Shieber, S. Causal mediation analysis for interpreting neural NLP : The case of gender bias. arXiv preprint arXiv:2004.12265, 2020 a

work page arXiv 2004
[38]

Vig, J., Gehrmann, S., Belinkov, Y., Qian, S., Nevo, D., Singer, Y., and Shieber, S. M. Investigating gender bias in language models using causal mediation analysis. In NeurIPS, 2020 b

work page 2020
[39]

and Komatsuzaki, A

Wang, B. and Komatsuzaki, A. GPT-J-6B: A 6 Billion Parameter Autoregressive Language Model . https://github.com/kingoflolz/mesh-transformer-jax, May 2021

work page 2021
[40]

Zhang, Y., Galley, M., Gao, J., Gan, Z., Li, X., Brockett, C., and Dolan, W. B. Generating informative and diverse conversational responses via adversarial information maximization. In NeurIPS, 2018

work page 2018
[41]

Of non-linearity and commutativity in BERT

Zhao, S., Pascual, D., Brunner, G., and Wattenhofer, R. Of non-linearity and commutativity in BERT . In 2021 International Joint Conference on Neural Networks (IJCNN), pp.\ 1--8. IEEE, 2021

work page 2021
[42]

Factual probing is [ MASK ]: Learning vs

Zhong, Z., Friedman, D., and Chen, D. Factual probing is [ MASK ]: Learning vs. learning to recall. In Proceedings of the 2021 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies, pp.\ 5017--5033, Online, June 2021. Association for Computational Linguistics. doi:10.18653/v1/2021.naacl-main...

work page doi:10.18653/v1/2021.naacl-main.398 2021
[43]

S., Zaheer, M., Bhojanapalli, S., Li, D., Yu, F., and Kumar, S

Zhu, C., Rawat, A. S., Zaheer, M., Bhojanapalli, S., Li, D., Yu, F., and Kumar, S. Modifying memories in transformer models. arXiv preprint arXiv:2012.00363, 2020

work page arXiv 2012