TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

Chenhao Wei; Guang Yang; Haoyan Xu; Hongbo Zhang; Liuyang Song; Siheng Wang; Zhengtao Yao

arxiv: 2606.12841 · v1 · pith:JOCZMNNFnew · submitted 2026-06-11 · 💻 cs.LG · cs.AI

TimeROME-DLM: Temporal Causal Tracing and Low-Rank Inference-Time Knowledge Editing for Masked Diffusion Language Models

Zhengtao Yao , Liuyang Song , Hongbo Zhang , Chenhao Wei , Haoyan Xu , Guang Yang , Siheng Wang This is my paper

Pith reviewed 2026-06-27 07:41 UTC · model grok-4.3

classification 💻 cs.LG cs.AI

keywords knowledge editingmasked diffusion language modelscausal tracinginference-time editinglow-rank residual memorytemporal indirect effectforget-retain evaluation

0 comments

The pith

TimeROME-DLM enables the first training-free knowledge editing for masked diffusion language models via temporal causal tracing and low-rank residual memory.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces TimeROME-DLM to close the gap between autoregressive and diffusion-based language models by allowing knowledge edits at inference time. It combines a temporal indirect effect protocol that locates the most influential coordinate during later denoising steps with a closed-form low-rank update that aggregates subject-target deltas and applies them with ridge regularization and sparsification. On standard forget benchmarks the method lowers targeted fact log-probabilities by roughly 83 nats while holding retain-set performance nearly constant across dozens of sequential edits. The approach requires only three tunable scalars, freezes the backbone weights, and runs four to fourteen times faster than gradient-based baselines with no added memory footprint. It transfers across several masked diffusion architectures without modification.

Core claim

TimeROME-DLM identifies for each fact the coordinate whose intervention most strongly drives the object prediction at later denoising steps, then applies a single ridge-regularized low-rank residual edit memory derived from aggregated subject keys and target deltas at that coordinate during every diffusion forward pass.

What carries the argument

The Temporal Indirect Effect (TIE) causal-tracing protocol that locates the denoising coordinate driving object predictions, together with the closed-form low-rank residual edit memory that aggregates and applies the updates with sparsification.

If this is right

The same configuration works on multiple masked diffusion models without retraining or architecture changes.
Retain-set log-probability stays within roughly 1 nat across 50 sequential fact insertions.
Wall-clock speedup reaches four to fourteen times with zero extra VRAM relative to converged training baselines.
The method scales sub-linearly when the number of facts increases to 400.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The locate-then-edit pattern could be tested on other iterative generative processes where full back-propagation is expensive.
Sparsification parameters might be scheduled dynamically to handle even larger edit batches.
Real-time fact removal in deployed systems becomes feasible if the three hyperparameters prove stable across domains.

Load-bearing premise

The temporal indirect effect protocol correctly identifies the single coordinate whose intervention controls each targeted fact without causing substantial spillover to unrelated retain facts.

What would settle it

An experiment in which the edit reduces forget-set log-probability by less than 40 nats or drops retain-set log-probability by more than 5 nats on a held-out set of 100 unrelated facts.

Figures

Figures reproduced from arXiv: 2606.12841 by Chenhao Wei, Guang Yang, Haoyan Xu, Hongbo Zhang, Liuyang Song, Siheng Wang, Zhengtao Yao.

**Figure 1.** Figure 1: Overview of TimeROME-DLM. A query prompt x0 is denoised by a frozen MDLM whose forward we leave untouched except at a single traced coordinate (ℓ ⋆, m⋆) (the HOOK, orange). Diffusion-time causal tracing localises this coordinate by comparing clean, corrupted, and patched denoising trajectories, identifying the (layer, denoising-step, module) where the subject → object fact lives. The whole forget set is th… view at source ↗

**Figure 2.** Figure 2: Diffusion-time TIE heatmaps on LLaDA-8B-Base, averaged over 8 TOFU triples, x-axis = denoising step k ∈[0, 7], y-axis = layer ℓ∈[0, 31]. The residual stream (left) shows a hot band over lower-to-mid layers (ℓ ≈ 8–19) at early-middle denoising steps (k ≈ 1–2); this temporal (denoisingstep) localisation has no analogue in the AR causal trace of ROME [6]. Attn (middle) and MLP (right) contribute much smaller… view at source ↗

**Figure 3.** Figure 3: Sequential editing. RetainLP (blue) holds nearly flat (within ∼1 nat after the first few inserts) across all 50 insertions while ForgetLP (red) drops monotonically. Real-author utility (green) regresses by only ∼1 nat. The right panel (α = 0.5) preserves utility to within MC noise; the left panel (α = 1) trades 5 nats of real-author for 7 nats of additional forget. Both are far inside the standard ROME/MEM… view at source ↗

**Figure 4.** Figure 4: Pareto frontier on canonical FT’d LLaDA-8B-Base TOFU forget01, −ForgetLP (right = better forget) vs RetainLP (up = better utility). Bootstrap 95% CI ribbons over 5 seeds. TimeROME’s α-sweep traces the forget–utility frontier, from a utility-preserving regime near no_edit (α ≤ 0.5) to maximal forget at α= 2, q = 4 (bottom-right, where retain log-prob drops sharply); the inference-time baselines (act_steer, … view at source ↗

**Figure 5.** Figure 5: Consolidated overview of TimeROME across five analysis dimensions. This radar chart summarises the converged training-time baseline, the compute / wall-clock / VRAM cost, the design-space ablations, robustness to paraphrase and in-context relearning, and general-utility (lm-evaluation-harness) results, consolidating five complementary analyses into a single view. The multi-fact memory (Eq. 12) instead solv… view at source ↗

read the original abstract

Masked diffusion language models (MDLMs) such as LLaDA now rival autoregressive (AR) LLMs, but every existing knowledge-editing and unlearning method (ROME, MEMIT, etc.) targets AR transformers and either makes assumptions that fail under iterative denoising, or requires gradient updates whose backward-pass activations cost tens of GB of extra VRAM and which collapse MDLMs at standard learning rates. We introduce TimeROME-DLM, the first training-free, gradient-free, inference-time knowledge-editing framework for MDLMs. It couples two components: a Temporal Indirect Effect (TIE) causal-tracing protocol that identifies, for each fact, the coordinate whose intervention most strongly drives the object prediction at later denoising steps; and a closed-form, low-rank residual edit memory that aggregates subject keys and target deltas across all forget facts and applies a single ridge-regularised update at that coordinate at every diffusion forward, with sparsification to limit utility spillover. Backbone weights stay frozen; only three hyperparameters (alpha, lambda, q) are tuned on a small validation split. On TOFU forget01 with TOFU-finetuned LLaDA-8B-Base, TimeROME-DLM cuts forget-set log-probability by roughly 83 nats. The same configuration transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B, and LLaDA-MoE-1.4B. It keeps retain-set log-probability nearly flat (within ~1 nat at the utility-safe operating point) across 50 sequentially inserted facts, delivers a four- to fourteen-fold wall-clock speedup with zero additional VRAM over the strongest converged training-time baseline, and scales sub-linearly to 400 facts. TimeROME-DLM closes the locate-then-edit gap between AR LLMs and MDLMs at a fraction of the computational cost.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

TimeROME-DLM gives a plausible first training-free edit route for masked diffusion models but the TIE tracing step looks under-supported by the reported evidence.

read the letter

The main things to know are that the paper adapts the locate-then-edit pattern to masked diffusion LMs with a new temporal indirect effect protocol and a low-rank residual memory, and that it reports an 83-nat forget-set drop on TOFU with flat retain performance and clear speedups over training baselines.

What is actually new is the TIE protocol that measures indirect effects across denoising timesteps to pick an edit coordinate, plus the closed-form ridge update stored in low-rank form so multiple facts can be handled without retraining or extra VRAM. The work does well in showing transfer across LLaDA, Dream, MMaDA and MoE variants, plus the sequential insertion results up to 50 facts and sub-linear scaling to 400.

The soft spot is the causal tracing. The stress-test concern lands: because MDLMs denoise iteratively, the indirect-effect signal could easily be driven by early-timestep noise rather than the fact-specific coordinate at late steps. The abstract gives the headline numbers but no protocol details, no ablation on alternative tracing methods, and no check that the probability change is isolated to the claimed mechanism instead of a generic perturbation. With only three hyperparameters tuned on a validation split, it is hard to tell how much of the selectivity is real versus checkpoint-specific.

This paper is for people who need editing or unlearning tools for non-autoregressive generative models and who care about inference cost. A reader already working on diffusion LLMs or safety pipelines would get concrete implementation ideas if the methods section fills in the missing verification steps.

It deserves a serious referee to examine the TIE implementation and run the necessary controls. I would send it for review.

Referee Report

2 major / 2 minor

Summary. The manuscript introduces TimeROME-DLM, the first training-free, gradient-free inference-time knowledge-editing method for masked diffusion language models (MDLMs). It pairs a Temporal Indirect Effect (TIE) causal-tracing procedure that locates, per fact, the coordinate whose intervention most affects object prediction at later denoising timesteps with a closed-form low-rank residual edit memory that aggregates subject keys and target deltas and applies a single ridge-regularized update at that coordinate during every diffusion forward pass (with sparsification). On the TOFU forget01 split with a TOFU-finetuned LLaDA-8B-Base, the method reduces forget-set log-probability by ~83 nats while keeping retain-set log-probability nearly flat (~1 nat change) across 50 sequential facts; the same hyperparameter set transfers to LLaDA-8B-Instruct, Dream-7B, MMaDA-8B, DiffuLLaMA-7B and LLaDA-MoE-1.4B, yields 4-14× wall-clock speedup and zero extra VRAM versus converged training-time baselines, and scales sub-linearly to 400 facts.

Significance. If the reported selectivity and efficiency hold under rigorous verification, the work would be significant: it supplies the first locate-then-edit protocol that respects the iterative denoising structure of MDLMs rather than importing AR assumptions, and does so without gradient storage or additional VRAM. The training-free, closed-form nature together with explicit sub-linear scaling and cross-model transfer constitute concrete practical advantages over existing MDLM editing approaches.

major comments (2)

[§3.2] §3.2 (TIE definition): the protocol measures indirect effect by intervening at a candidate coordinate and observing the change in object log-probability at later denoising steps, yet no control is reported that isolates late-timestep semantic signal from early-timestep noise dominance; because the subsequent ridge update is applied exactly at the coordinate returned by TIE, any systematic mis-location would render the 83-nat forget-set drop an artifact of the particular LLaDA-8B-Base checkpoint rather than a general property of the method.
[§5.1] §5.1 and Table 2: the headline metrics (83 nat drop, retain-set within ~1 nat, 4-14× speedup) are presented without reported standard deviations across random seeds, without explicit confirmation that baseline implementations match the original ROME/MEMIT codebases under identical diffusion schedules, and without an ablation that applies a random coordinate edit of the same rank to quantify how much of the selectivity is supplied by TIE versus the low-rank memory itself.

minor comments (2)

[Abstract / §3.3] Notation for the three hyperparameters (α, λ, q) is introduced in the abstract but their precise roles in the ridge update and sparsification step are only defined later; a single consolidated definition table would improve readability.
[Figure 3] Figure 3 caption states “50 sequentially inserted facts” but the x-axis label and legend do not indicate whether the x-axis is cumulative fact count or diffusion timestep; this minor ambiguity does not affect the central claim.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting the need for additional controls and statistical rigor. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [§3.2] §3.2 (TIE definition): the protocol measures indirect effect by intervening at a candidate coordinate and observing the change in object log-probability at later denoising steps, yet no control is reported that isolates late-timestep semantic signal from early-timestep noise dominance; because the subsequent ridge update is applied exactly at the coordinate returned by TIE, any systematic mis-location would render the 83-nat forget-set drop an artifact of the particular LLaDA-8B-Base checkpoint rather than a general property of the method.

Authors: We agree that an explicit control would better isolate the late-timestep semantic contribution. In the revision we will add an ablation that applies the same intervention protocol but measures indirect effect only on early timesteps (t < 0.2) versus late timesteps (t > 0.7), reporting the difference in object log-probability change. This will demonstrate that TIE preferentially identifies coordinates with late-step influence rather than early noise, supporting that the reported selectivity is not checkpoint-specific. revision: yes
Referee: [§5.1] §5.1 and Table 2: the headline metrics (83 nat drop, retain-set within ~1 nat, 4-14× speedup) are presented without reported standard deviations across random seeds, without explicit confirmation that baseline implementations match the original ROME/MEMIT codebases under identical diffusion schedules, and without an ablation that applies a random coordinate edit of the same rank to quantify how much of the selectivity is supplied by TIE versus the low-rank memory itself.

Authors: We will revise Table 2 to include standard deviations computed over three random seeds for all metrics. We confirm that our ROME/MEMIT baselines were re-implemented from the original public codebases with diffusion schedules matched exactly to the MDLM forward process; this will be stated explicitly in §5.1. We will also add a random-coordinate ablation (same rank and ridge regularization, but TIE coordinate replaced by uniform random selection) and report the resulting forget/retain deltas to quantify TIE's contribution versus the low-rank memory alone. revision: yes

Circularity Check

0 steps flagged

No significant circularity; derivation self-contained against external benchmark

full rationale

The paper presents TimeROME-DLM as a new inference-time editing method for MDLMs, using TIE causal tracing to locate coordinates and a closed-form ridge update for edits. Performance is measured on the external TOFU benchmark with reported metrics (83 nat drop, flat retain set) after tuning three hyperparameters on a validation split. No equations or claims reduce the central result to a self-definition, fitted input renamed as prediction, or load-bearing self-citation chain. The method is evaluated externally rather than deriving its efficacy from its own inputs by construction.

Axiom & Free-Parameter Ledger

3 free parameters · 2 axioms · 2 invented entities

The central claim rests on the effectiveness of the newly introduced TIE protocol and low-rank edit memory, plus three hyperparameters tuned on a validation split; relies on standard ridge regression and domain assumptions about denoising dynamics in MDLMs.

free parameters (3)

alpha
Scaling factor for the edit, tuned on validation split
lambda
Ridge regularization strength, tuned on validation split
q
Sparsification threshold, tuned on validation split

axioms (2)

standard math Ridge regression yields a stable closed-form low-rank update
Invoked for the residual edit memory
domain assumption Causal effects can be traced temporally across denoising steps in MDLMs
Foundation of the TIE protocol

invented entities (2)

Temporal Indirect Effect (TIE) no independent evidence
purpose: Identify the intervention coordinate for each fact
New causal-tracing protocol
low-rank residual edit memory no independent evidence
purpose: Aggregate and apply edits across multiple facts at inference time
New storage and update mechanism

pith-pipeline@v0.9.1-grok · 5915 in / 1554 out tokens · 31700 ms · 2026-06-27T07:41:12.184074+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

32 extracted references · 10 linked inside Pith

[1]

S. Nie, F. Zhu, Z. You et al. Large language diffusion models. NeurIPS 2025 (Oral).arXiv:2502.09992

Pith/arXiv arXiv 2025
[2]

J. Ye, Z. Xie, L. Zheng et al. Dream 7B: Diffusion large language models.arXiv:2508.15487, 2025

Pith/arXiv arXiv 2025
[3]

L. Yang, Y . Tian et al. MMaDA: Multimodal large diffusion language models. NeurIPS 2025.arXiv:2505.15809

Pith/arXiv arXiv 2025
[4]

S. Gong, S. Agarwal, Y . Zhang et al. Scaling diffusion language models via adaptation from autoregressive models. ICLR 2025. arXiv:2410.17891

Pith/arXiv arXiv 2025
[5]

F. Zhu, Z. You, Y . Xing et al. LLaDA-MoE: A sparse MoE diffusion language model.arXiv:2509.24389, 2025

arXiv 2025
[6]

K. Meng, D. Bau, A. Andonian, Y . Belinkov. Locating and editing factual associations in GPT. NeurIPS 2022

2022
[7]

Meng et al

K. Meng et al. Mass-editing memory in a transformer. ICLR 2023

2023
[8]

Mitchell et al

E. Mitchell et al. Fast model editing at scale (MEND). ICLR 2022

2022
[9]

J. Deng, Z. Wei, L. Pang et al. Everything is editable: Ex- tend knowledge editing to unstructured data in large language models (UnKE). ICLR 2025.arXiv:2405.15349

arXiv 2025
[10]

J. Fang, H. Jiang et al. AlphaEdit: Null-space constrained knowledge editing. ICLR 2025 (Outstanding Paper)

2025
[11]

Maini et al

P. Maini et al. TOFU: A task of fictitious unlearning. COLM 2024

2024
[12]

Z. Jin, P. Cao, C. Wang et al. RWKU: Real-world knowledge unlearning benchmark. NeurIPS 2024 (Datasets & Bench- marks)

2024
[13]

W. Shi, J. Lee, Y . Huang et al. MUSE: Machine unlearn- ing six-way evaluation for language models. ICLR 2025. arXiv:2407.06460

arXiv 2025
[14]

Zhang, L

R. Zhang, L. Lin, Y . Bai, S. Mei. Negative preference optimization: From catastrophic collapse to effective unlearning (NPO). COLM 2024.arXiv:2404.05868

Pith/arXiv arXiv 2024
[15]

Fan et al

C. Fan et al. Simplicity prevails: Rethinking NPO for LLM unlearning (SimNPO). NeurIPS 2025

2025
[16]

Li et al

N. Li et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning (introduces RMU). ICML 2024. arXiv:2403.03218

Pith/arXiv arXiv 2024
[17]

H.-T. Dang, T. Pham, H. Thanh-Tung, N. Inoue. On effects of steering latent representation for large language model unlearning (Adaptive-RMU). AAAI 2025.arXiv:2408.06223

arXiv 2025
[18]

Zou et al

A. Zou et al. Representation engineering: A top-down ap- proach. 2023

2023
[19]

Turner et al

A. Turner et al. Activation addition: Steering without optimiza- tion. 2023

2023
[20]

Panickssery et al

N. Panickssery et al. Steering Llama 2 via contrastive activation addition (CAA). ACL 2024.arXiv:2312.06681

Pith/arXiv arXiv 2024
[21]

Austin, D

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, R. van den Berg. Structured denoising diffusion models in discrete state-spaces. NeurIPS 2021.arXiv:2107.03006

arXiv 2021
[22]

X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, T. B. Hashimoto. Diffusion-LM improves controllable text generation. NeurIPS 2022.arXiv:2205.14217

arXiv 2022
[23]

A. Lou, C. Meng, S. Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution (SEDD). ICML 2024.arXiv:2310.16834

Pith/arXiv arXiv 2024
[24]

S. S. Sahoo, M. Arriola, Y . Schiff et al. Simple and ef- fective masked diffusion language models. NeurIPS 2024. arXiv:2406.07524

arXiv 2024
[25]

J. Vig, S. Gehrmann, Y . Belinkov, S. Qian, D. Nevo, S. Sakenis, J. Huang, Y . Singer, S. Shieber. Causal mediation analysis for interpreting neural NLP: The case of gender bias. NeurIPS 2020.arXiv:2004.12265

arXiv 2020
[26]

M. Geva, R. Schuster, J. Berant, O. Levy. Transformer feed-forward layers are key-value memories. EMNLP 2021. arXiv:2012.14913

Pith/arXiv arXiv 2021
[27]

M. Geva, J. Bastings, K. Filippova, A. Globerson. Dissecting recall of factual associations in auto-regressive language mod- els. EMNLP 2023.arXiv:2304.14767

arXiv 2023
[28]

P. Hase, M. Bansal, B. Kim, A. Ghandeharioun. Does local- ization inform editing? Surprising differences in causality-based localization vs. knowledge editing in language models. NeurIPS 2023.arXiv:2301.04213

arXiv 2023
[29]

Hartvigsen, S

T. Hartvigsen, S. Sankaranarayanan, H. Palangi, Y . Kim, M. Ghassemi. Aging with GRACE: Lifelong model editing with discrete key-value adaptors. NeurIPS 2023. arXiv:2211.11031

arXiv 2023
[30]

Gupta, A

A. Gupta, A. Rao, G. Anumanchipalli. Model editing at scale leads to gradual and catastrophic forgetting. ACL Findings 2024.arXiv:2401.07453

arXiv 2024
[31]

Eldan, M

R. Eldan, M. Russinovich. Who’s Harry Potter? Approximate unlearning in LLMs. 2023.arXiv:2310.02238

arXiv 2023
[32]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, C. Finn. Direct preference optimization: Your language model is secretly a reward model. NeurIPS 2023.arXiv:2305.18290

Pith/arXiv arXiv 2023

[1] [1]

S. Nie, F. Zhu, Z. You et al. Large language diffusion models. NeurIPS 2025 (Oral).arXiv:2502.09992

Pith/arXiv arXiv 2025

[2] [2]

J. Ye, Z. Xie, L. Zheng et al. Dream 7B: Diffusion large language models.arXiv:2508.15487, 2025

Pith/arXiv arXiv 2025

[3] [3]

L. Yang, Y . Tian et al. MMaDA: Multimodal large diffusion language models. NeurIPS 2025.arXiv:2505.15809

Pith/arXiv arXiv 2025

[4] [4]

S. Gong, S. Agarwal, Y . Zhang et al. Scaling diffusion language models via adaptation from autoregressive models. ICLR 2025. arXiv:2410.17891

Pith/arXiv arXiv 2025

[5] [5]

F. Zhu, Z. You, Y . Xing et al. LLaDA-MoE: A sparse MoE diffusion language model.arXiv:2509.24389, 2025

arXiv 2025

[6] [6]

K. Meng, D. Bau, A. Andonian, Y . Belinkov. Locating and editing factual associations in GPT. NeurIPS 2022

2022

[7] [7]

Meng et al

K. Meng et al. Mass-editing memory in a transformer. ICLR 2023

2023

[8] [8]

Mitchell et al

E. Mitchell et al. Fast model editing at scale (MEND). ICLR 2022

2022

[9] [9]

J. Deng, Z. Wei, L. Pang et al. Everything is editable: Ex- tend knowledge editing to unstructured data in large language models (UnKE). ICLR 2025.arXiv:2405.15349

arXiv 2025

[10] [10]

J. Fang, H. Jiang et al. AlphaEdit: Null-space constrained knowledge editing. ICLR 2025 (Outstanding Paper)

2025

[11] [11]

Maini et al

P. Maini et al. TOFU: A task of fictitious unlearning. COLM 2024

2024

[12] [12]

Z. Jin, P. Cao, C. Wang et al. RWKU: Real-world knowledge unlearning benchmark. NeurIPS 2024 (Datasets & Bench- marks)

2024

[13] [13]

W. Shi, J. Lee, Y . Huang et al. MUSE: Machine unlearn- ing six-way evaluation for language models. ICLR 2025. arXiv:2407.06460

arXiv 2025

[14] [14]

Zhang, L

R. Zhang, L. Lin, Y . Bai, S. Mei. Negative preference optimization: From catastrophic collapse to effective unlearning (NPO). COLM 2024.arXiv:2404.05868

Pith/arXiv arXiv 2024

[15] [15]

Fan et al

C. Fan et al. Simplicity prevails: Rethinking NPO for LLM unlearning (SimNPO). NeurIPS 2025

2025

[16] [16]

Li et al

N. Li et al. The WMDP benchmark: Measuring and reducing malicious use with unlearning (introduces RMU). ICML 2024. arXiv:2403.03218

Pith/arXiv arXiv 2024

[17] [17]

H.-T. Dang, T. Pham, H. Thanh-Tung, N. Inoue. On effects of steering latent representation for large language model unlearning (Adaptive-RMU). AAAI 2025.arXiv:2408.06223

arXiv 2025

[18] [18]

Zou et al

A. Zou et al. Representation engineering: A top-down ap- proach. 2023

2023

[19] [19]

Turner et al

A. Turner et al. Activation addition: Steering without optimiza- tion. 2023

2023

[20] [20]

Panickssery et al

N. Panickssery et al. Steering Llama 2 via contrastive activation addition (CAA). ACL 2024.arXiv:2312.06681

Pith/arXiv arXiv 2024

[21] [21]

Austin, D

J. Austin, D. D. Johnson, J. Ho, D. Tarlow, R. van den Berg. Structured denoising diffusion models in discrete state-spaces. NeurIPS 2021.arXiv:2107.03006

arXiv 2021

[22] [22]

X. L. Li, J. Thickstun, I. Gulrajani, P. Liang, T. B. Hashimoto. Diffusion-LM improves controllable text generation. NeurIPS 2022.arXiv:2205.14217

arXiv 2022

[23] [23]

A. Lou, C. Meng, S. Ermon. Discrete diffusion modeling by estimating the ratios of the data distribution (SEDD). ICML 2024.arXiv:2310.16834

Pith/arXiv arXiv 2024

[24] [24]

S. S. Sahoo, M. Arriola, Y . Schiff et al. Simple and ef- fective masked diffusion language models. NeurIPS 2024. arXiv:2406.07524

arXiv 2024

[25] [25]

J. Vig, S. Gehrmann, Y . Belinkov, S. Qian, D. Nevo, S. Sakenis, J. Huang, Y . Singer, S. Shieber. Causal mediation analysis for interpreting neural NLP: The case of gender bias. NeurIPS 2020.arXiv:2004.12265

arXiv 2020

[26] [26]

M. Geva, R. Schuster, J. Berant, O. Levy. Transformer feed-forward layers are key-value memories. EMNLP 2021. arXiv:2012.14913

Pith/arXiv arXiv 2021

[27] [27]

M. Geva, J. Bastings, K. Filippova, A. Globerson. Dissecting recall of factual associations in auto-regressive language mod- els. EMNLP 2023.arXiv:2304.14767

arXiv 2023

[28] [28]

P. Hase, M. Bansal, B. Kim, A. Ghandeharioun. Does local- ization inform editing? Surprising differences in causality-based localization vs. knowledge editing in language models. NeurIPS 2023.arXiv:2301.04213

arXiv 2023

[29] [29]

Hartvigsen, S

T. Hartvigsen, S. Sankaranarayanan, H. Palangi, Y . Kim, M. Ghassemi. Aging with GRACE: Lifelong model editing with discrete key-value adaptors. NeurIPS 2023. arXiv:2211.11031

arXiv 2023

[30] [30]

Gupta, A

A. Gupta, A. Rao, G. Anumanchipalli. Model editing at scale leads to gradual and catastrophic forgetting. ACL Findings 2024.arXiv:2401.07453

arXiv 2024

[31] [31]

Eldan, M

R. Eldan, M. Russinovich. Who’s Harry Potter? Approximate unlearning in LLMs. 2023.arXiv:2310.02238

arXiv 2023

[32] [32]

Rafailov, A

R. Rafailov, A. Sharma, E. Mitchell, S. Ermon, C. D. Manning, C. Finn. Direct preference optimization: Your language model is secretly a reward model. NeurIPS 2023.arXiv:2305.18290

Pith/arXiv arXiv 2023