ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

Chengyi Yang; Jinsong Su; Yiping Song; Yujie Lin; Zhishang Xiang

arxiv: 2605.18879 · v3 · pith:OJWBZ6R2new · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CL

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

Yujie Lin , Chengyi Yang , Zhishang Xiang , Yiping Song , Jinsong Su This is my paper

Pith reviewed 2026-05-21 08:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL

keywords machine unlearninglarge language modelsmodel editingknowledge removalfew-shot unlearningrepresentational orthogonalityprivacy and safety

0 comments

The pith

Large language models can unlearn specific sensitive knowledge by remapping inputs to neutral outputs through a closed-form multiplicative parameter update that enforces orthogonality in internal representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes machine unlearning as a targeted model editing task instead of retraining from scratch or aggressive fine-tuning. It proposes overwriting sensitive inputs by mapping them to a neutral target state while using a multiplicative update to strip away their original representations. A sympathetic reader would care because current approaches either demand heavy computation or risk damaging the model's performance on unrelated tasks. The method aims for efficient few-shot unlearning that keeps general capabilities intact. This addresses privacy and safety issues from models trained on broad web data without sacrificing utility.

Core claim

The central claim is that machine unlearning can be reformulated as a precise knowledge re-mapping problem via model editing. By mapping sensitive inputs to a neutral target state and removing their original representations through a multiplicative parameter update with a closed-form solution that enforces representational orthogonality, unlearning becomes efficient and targeted. This approach extends to a gradient-based variant for handling multiple samples and is shown to outperform baselines while preserving overall model utility.

What carries the argument

The multiplicative parameter update with a closed-form solution that enforces representational orthogonality, which performs the overwrite by mapping sensitive inputs to neutral target states and removes their original representations.

If this is right

Sensitive inputs stop triggering the original harmful or private generations after the update.
Performance on general tasks and unrelated knowledge stays largely unchanged.
Unlearning works with only a few examples instead of full retraining or heavy fine-tuning.
The closed-form solution makes the process computationally efficient compared to iterative optimization methods.
The gradient-based extension allows handling multiple samples while maintaining the orthogonality property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The approach could reduce the need for periodic full retraining when new privacy regulations require removing specific data points.
It might combine with other editing techniques to handle sequential or conflicting unlearning requests over time.
Similar remapping ideas could apply to domains beyond language models, such as removing biases in vision systems.
Deployment in production systems would require verifying that the neutral target state does not introduce new unintended behaviors.

Load-bearing premise

Mapping sensitive inputs to a neutral target state combined with enforcing representational orthogonality via the multiplicative closed-form update will remove the original representations sufficiently to achieve unlearning without degrading unrelated knowledge or overall utility.

What would settle it

A direct test would check whether the updated model still generates the original sensitive content when given related prompts, or whether its accuracy on standard benchmarks for unrelated tasks falls measurably below the baseline model.

Figures

Figures reproduced from arXiv: 2605.18879 by Chengyi Yang, Jinsong Su, Yiping Song, Yujie Lin, Zhishang Xiang.

**Figure 1.** Figure 1: Geometric illustration of ZeroUnlearn. The original sensitive output mf is first projected onto the null space via the projection matrix P (Step a). Subsequently, the optimization process aligns the projected representation with the target neutral state mn (Step b) to achieve precise knowledge erasure. 3.2. Autoregressive Large Language Models Autoregressive LLMs acquire and store knowledge through next-to… view at source ↗

**Figure 2.** Figure 2: Causal tracing for knowledge localization. 6. Experiments 6.1. Settings Base Model and Baselines. We employ three widely adopted models, Llama-3.2-3B-Instruct (Llama-3.2), Llama3.1-8B-Instruct (Llama-3.1) (Grattafiori et al., 2024) and Qwen-3-4B (Qwen-3) (Yang et al., 2025), as our base models. Since knowledge editing-based approaches typically utilize only the forget set, we adopt GA (Jang et al., 2023)… view at source ↗

**Figure 3.** Figure 3: PCA visualization of MLP representation shifts at Layer 16 of Llama-3.2 on the MCF dataset. SST MMLU MRPC COLA RTE NLI 0 20 40 60 80 Accuracy Downstream Task Evaluation Base GA FT ROME MEMIT AlphaEdit ZeroUnlearn [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗

**Figure 4.** Figure 4: Evaluation of general capabilities on Llama-3.2. edge, significantly outperforming dedicated mass-editing baselines like MEMIT and AlphaEdit, which struggle to eliminate residual information. Crucially, ZeroUnlearn-GD achieves this thorough unlearning without the catastrophic model collapse observed in optimization-based approaches; while GA and FT lead to exploded perplexity and a total loss of specificit… view at source ↗

**Figure 5.** Figure 5: illustrates the variation in the average indirect effect (AIE) for each token across all layers. We observe that for MLP outputs, the layers where the last subject token exhibits the peak AIE are often concentrated in the model’s early (bottom) layers. However, our experiments reveal that editing these lower layers significantly compromises the model’s general capabilities. In practice, specifically for Ll… view at source ↗

**Figure 6.** Figure 6: Average Indirect Effect of Attention modules across different architectures. 0 5 10 15 20 25 Single patched layer First subject token Middle subject tokens Last subject token First subsequent token Further tokens Last token (a) Llama-3.2-3B-Instruct Avg Indirect Effect of h (l) i 0 5 10 15 20 25 30 Single patched layer (b) Llama-3.1-8B-Instruct Avg Indirect Effect of h (l) i 0 5 10 15 20 25 30 35 Single pa… view at source ↗

**Figure 7.** Figure 7: Layer-wise causal efficacy of hidden states (h (l) i ). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: PCA visualization of MLP representation shifts at Layer 19 of Llama-3.1. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗

**Figure 9.** Figure 9: PCA visualization of MLP representation shifts at Layer 16 of Llama-3.2. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗

**Figure 10.** Figure 10: PCA visualization of MLP representation shifts at Layer 9 of Qwen3-4B. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

read the original abstract

Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine-tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes ZeroUnlearn, a few-shot unlearning framework for LLMs that reformulates the problem as knowledge re-mapping. Sensitive inputs are mapped to a neutral target state, and their original representations are removed via a multiplicative parameter update whose closed-form solution enforces representational orthogonality. A gradient-based extension handles multi-sample cases. Experiments are stated to demonstrate outperformance over baselines while preserving general model utility.

Significance. If the empirical claims hold, the approach would provide an efficient, low-cost alternative to retraining or aggressive fine-tuning for targeted unlearning, with direct relevance to privacy and safety applications. The closed-form multiplicative update and public code release are concrete strengths that aid reproducibility and verification.

major comments (1)

[Method description and experimental evaluation] The central unlearning guarantee rests on the claim that mapping a few-shot set of sensitive inputs to a neutral target and enforcing orthogonality via the closed-form update removes the underlying knowledge. Because LLM knowledge is distributed, this local edit on exact inputs may leave facts elicitable through paraphrases, contextual inference, or related queries never seen in the update; the manuscript should include targeted experiments measuring success against such indirect prompts to substantiate the claim.

minor comments (2)

[Abstract] The abstract asserts outperformance and utility preservation but supplies no concrete metrics, baselines, dataset sizes, or controls; these details must appear explicitly in the results section with tables or figures.
[Method] Notation for the neutral target state and the multiplicative update matrix should be defined once with consistent symbols across equations and text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive summary and for identifying a key area for strengthening our empirical claims. We address the major comment below and will revise the manuscript to incorporate the suggested evaluation.

read point-by-point responses

Referee: The central unlearning guarantee rests on the claim that mapping a few-shot set of sensitive inputs to a neutral target and enforcing orthogonality via the closed-form update removes the underlying knowledge. Because LLM knowledge is distributed, this local edit on exact inputs may leave facts elicitable through paraphrases, contextual inference, or related queries never seen in the update; the manuscript should include targeted experiments measuring success against such indirect prompts to substantiate the claim.

Authors: We agree that the distributed nature of knowledge in LLMs means local edits on exact inputs may not fully address elicitation via paraphrases or related queries, and that targeted experiments on indirect prompts would strengthen the unlearning guarantee. Our current experiments evaluate direct removal on the provided sensitive inputs, showing effective mapping to neutral states, orthogonality, and outperformance over baselines with preserved utility. To address this point, we will add experiments in the revised manuscript that test the unlearned model on paraphrased versions of the sensitive inputs, contextual inference queries, and related but unseen queries, measuring whether the original knowledge remains inaccessible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ZeroUnlearn derivation chain

full rationale

The paper reformulates unlearning as a re-mapping problem and derives a closed-form multiplicative parameter update directly from the stated goal of enforcing representational orthogonality after mapping sensitive inputs to a neutral target state. This is an algebraic solution to an explicit objective rather than a self-referential definition, a fitted input renamed as prediction, or a result that reduces to its own inputs by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or smuggled ansatzes are indicated; the mechanism is presented as following from the edit equations without circular reduction. The overall claim remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on standard model editing assumptions plus the specific choice of neutral target state and orthogonality enforcement; no new invented physical entities.

free parameters (1)

neutral target state
The specific neutral mapping target for sensitive inputs is introduced to overwrite original representations and is not derived from first principles.

axioms (1)

domain assumption A multiplicative parameter update with closed-form solution can enforce representational orthogonality between original and new states.
This is the core mechanism invoked to achieve targeted removal of original representations.

pith-pipeline@v0.9.0 · 5713 in / 1412 out tokens · 57767 ms · 2026-05-21T08:02:37.667291+00:00 · methodology

Review history (2 revisions) →

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

projecting sensitive inputs into a null space orthogonal to their original representations

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.