pith. sign in

arxiv: 2605.18879 · v2 · pith:OJWBZ6R2new · submitted 2026-05-16 · 💻 cs.LG · cs.AI· cs.CL

ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models

Pith reviewed 2026-05-21 08:02 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CL
keywords machine unlearninglarge language modelsmodel editingknowledge removalfew-shot unlearningrepresentational orthogonalityprivacy and safety
0
0 comments X

The pith

Large language models can unlearn specific sensitive knowledge by remapping inputs to neutral outputs through a closed-form multiplicative parameter update that enforces orthogonality in internal representations.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper reframes machine unlearning as a targeted model editing task instead of retraining from scratch or aggressive fine-tuning. It proposes overwriting sensitive inputs by mapping them to a neutral target state while using a multiplicative update to strip away their original representations. A sympathetic reader would care because current approaches either demand heavy computation or risk damaging the model's performance on unrelated tasks. The method aims for efficient few-shot unlearning that keeps general capabilities intact. This addresses privacy and safety issues from models trained on broad web data without sacrificing utility.

Core claim

The central claim is that machine unlearning can be reformulated as a precise knowledge re-mapping problem via model editing. By mapping sensitive inputs to a neutral target state and removing their original representations through a multiplicative parameter update with a closed-form solution that enforces representational orthogonality, unlearning becomes efficient and targeted. This approach extends to a gradient-based variant for handling multiple samples and is shown to outperform baselines while preserving overall model utility.

What carries the argument

The multiplicative parameter update with a closed-form solution that enforces representational orthogonality, which performs the overwrite by mapping sensitive inputs to neutral target states and removes their original representations.

If this is right

  • Sensitive inputs stop triggering the original harmful or private generations after the update.
  • Performance on general tasks and unrelated knowledge stays largely unchanged.
  • Unlearning works with only a few examples instead of full retraining or heavy fine-tuning.
  • The closed-form solution makes the process computationally efficient compared to iterative optimization methods.
  • The gradient-based extension allows handling multiple samples while maintaining the orthogonality property.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The approach could reduce the need for periodic full retraining when new privacy regulations require removing specific data points.
  • It might combine with other editing techniques to handle sequential or conflicting unlearning requests over time.
  • Similar remapping ideas could apply to domains beyond language models, such as removing biases in vision systems.
  • Deployment in production systems would require verifying that the neutral target state does not introduce new unintended behaviors.

Load-bearing premise

Mapping sensitive inputs to a neutral target state combined with enforcing representational orthogonality via the multiplicative closed-form update will remove the original representations sufficiently to achieve unlearning without degrading unrelated knowledge or overall utility.

What would settle it

A direct test would check whether the updated model still generates the original sensitive content when given related prompts, or whether its accuracy on standard benchmarks for unrelated tasks falls measurably below the baseline model.

Figures

Figures reproduced from arXiv: 2605.18879 by Chengyi Yang, Jinsong Su, Yiping Song, Yujie Lin, Zhishang Xiang.

Figure 1
Figure 1. Figure 1: Geometric illustration of ZeroUnlearn. The original sensitive output mf is first projected onto the null space via the projection matrix P (Step a). Subsequently, the optimization process aligns the projected representation with the target neutral state mn (Step b) to achieve precise knowledge erasure. 3.2. Autoregressive Large Language Models Autoregressive LLMs acquire and store knowledge through next-to… view at source ↗
Figure 2
Figure 2. Figure 2: Causal tracing for knowledge localization. 6. Experiments 6.1. Settings Base Model and Baselines. We employ three widely adopted models, Llama-3.2-3B-Instruct (Llama-3.2), Llama￾3.1-8B-Instruct (Llama-3.1) (Grattafiori et al., 2024) and Qwen-3-4B (Qwen-3) (Yang et al., 2025), as our base models. Since knowledge editing-based approaches typ￾ically utilize only the forget set, we adopt GA (Jang et al., 2023)… view at source ↗
Figure 3
Figure 3. Figure 3: PCA visualization of MLP representation shifts at Layer 16 of Llama-3.2 on the MCF dataset. SST MMLU MRPC COLA RTE NLI 0 20 40 60 80 Accuracy Downstream Task Evaluation Base GA FT ROME MEMIT AlphaEdit ZeroUnlearn [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Evaluation of general capabilities on Llama-3.2. edge, significantly outperforming dedicated mass-editing baselines like MEMIT and AlphaEdit, which struggle to eliminate residual information. Crucially, ZeroUnlearn-GD achieves this thorough unlearning without the catastrophic model collapse observed in optimization-based approaches; while GA and FT lead to exploded perplexity and a total loss of specificit… view at source ↗
Figure 5
Figure 5. Figure 5: illustrates the variation in the average indirect effect (AIE) for each token across all layers. We observe that for MLP outputs, the layers where the last subject token exhibits the peak AIE are often concentrated in the model’s early (bottom) layers. However, our experiments reveal that editing these lower layers significantly compromises the model’s general capabilities. In practice, specifically for Ll… view at source ↗
Figure 6
Figure 6. Figure 6: Average Indirect Effect of Attention modules across different architectures. 0 5 10 15 20 25 Single patched layer First subject token Middle subject tokens Last subject token First subsequent token Further tokens Last token (a) Llama-3.2-3B-Instruct Avg Indirect Effect of h (l) i 0 5 10 15 20 25 30 Single patched layer (b) Llama-3.1-8B-Instruct Avg Indirect Effect of h (l) i 0 5 10 15 20 25 30 35 Single pa… view at source ↗
Figure 7
Figure 7. Figure 7: Layer-wise causal efficacy of hidden states (h (l) i ). 16 [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: PCA visualization of MLP representation shifts at Layer 19 of Llama-3.1. 19 [PITH_FULL_IMAGE:figures/full_fig_p019_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: PCA visualization of MLP representation shifts at Layer 16 of Llama-3.2. 20 [PITH_FULL_IMAGE:figures/full_fig_p020_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: PCA visualization of MLP representation shifts at Layer 9 of Qwen3-4B. 21 [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗
read the original abstract

Large language models inevitably retain sensitive information, defined as inputs that may induce harmful generations, due to training on massive web corpora, raising concerns for privacy and safety. Existing machine unlearning methods primarily rely on retraining or aggressive fine-tuning, which are either computationally expensive or prone to degrading related knowledge and overall model utility. In this work, we reformulate machine unlearning as a precise knowledge re-mapping problem via model editing. We propose ZeroUnlearn, a few-shot unlearning framework. It overwrites sensitive inputs by mapping them to a neutral target state and removing their original representations. ZeroUnlearn enforces representational orthogonality through a multiplicative parameter update with a closed-form solution, enabling efficient and targeted unlearning. We further extend ZeroUnlearn to a gradient-based variant for multi-sample unlearning. Experiments demonstrate that our approach outperforms existing baselines while preserving general model utility. Our code is available at the github: https://github.com/XMUDeepLIT/ZeroUnlearn.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

1 major / 2 minor

Summary. The manuscript proposes ZeroUnlearn, a few-shot unlearning framework for LLMs that reformulates the problem as knowledge re-mapping. Sensitive inputs are mapped to a neutral target state, and their original representations are removed via a multiplicative parameter update whose closed-form solution enforces representational orthogonality. A gradient-based extension handles multi-sample cases. Experiments are stated to demonstrate outperformance over baselines while preserving general model utility.

Significance. If the empirical claims hold, the approach would provide an efficient, low-cost alternative to retraining or aggressive fine-tuning for targeted unlearning, with direct relevance to privacy and safety applications. The closed-form multiplicative update and public code release are concrete strengths that aid reproducibility and verification.

major comments (1)
  1. [Method description and experimental evaluation] The central unlearning guarantee rests on the claim that mapping a few-shot set of sensitive inputs to a neutral target and enforcing orthogonality via the closed-form update removes the underlying knowledge. Because LLM knowledge is distributed, this local edit on exact inputs may leave facts elicitable through paraphrases, contextual inference, or related queries never seen in the update; the manuscript should include targeted experiments measuring success against such indirect prompts to substantiate the claim.
minor comments (2)
  1. [Abstract] The abstract asserts outperformance and utility preservation but supplies no concrete metrics, baselines, dataset sizes, or controls; these details must appear explicitly in the results section with tables or figures.
  2. [Method] Notation for the neutral target state and the multiplicative update matrix should be defined once with consistent symbols across equations and text.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive summary and for identifying a key area for strengthening our empirical claims. We address the major comment below and will revise the manuscript to incorporate the suggested evaluation.

read point-by-point responses
  1. Referee: The central unlearning guarantee rests on the claim that mapping a few-shot set of sensitive inputs to a neutral target and enforcing orthogonality via the closed-form update removes the underlying knowledge. Because LLM knowledge is distributed, this local edit on exact inputs may leave facts elicitable through paraphrases, contextual inference, or related queries never seen in the update; the manuscript should include targeted experiments measuring success against such indirect prompts to substantiate the claim.

    Authors: We agree that the distributed nature of knowledge in LLMs means local edits on exact inputs may not fully address elicitation via paraphrases or related queries, and that targeted experiments on indirect prompts would strengthen the unlearning guarantee. Our current experiments evaluate direct removal on the provided sensitive inputs, showing effective mapping to neutral states, orthogonality, and outperformance over baselines with preserved utility. To address this point, we will add experiments in the revised manuscript that test the unlearned model on paraphrased versions of the sensitive inputs, contextual inference queries, and related but unseen queries, measuring whether the original knowledge remains inaccessible. revision: yes

Circularity Check

0 steps flagged

No significant circularity in ZeroUnlearn derivation chain

full rationale

The paper reformulates unlearning as a re-mapping problem and derives a closed-form multiplicative parameter update directly from the stated goal of enforcing representational orthogonality after mapping sensitive inputs to a neutral target state. This is an algebraic solution to an explicit objective rather than a self-referential definition, a fitted input renamed as prediction, or a result that reduces to its own inputs by construction. No load-bearing self-citations, uniqueness theorems imported from prior author work, or smuggled ansatzes are indicated; the mechanism is presented as following from the edit equations without circular reduction. The overall claim remains self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The approach relies on standard model editing assumptions plus the specific choice of neutral target state and orthogonality enforcement; no new invented physical entities.

free parameters (1)
  • neutral target state
    The specific neutral mapping target for sensitive inputs is introduced to overwrite original representations and is not derived from first principles.
axioms (1)
  • domain assumption A multiplicative parameter update with closed-form solution can enforce representational orthogonality between original and new states.
    This is the core mechanism invoked to achieve targeted removal of original representations.

pith-pipeline@v0.9.0 · 5713 in / 1412 out tokens · 57767 ms · 2026-05-21T08:02:37.667291+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

24 extracted references · 24 canonical work pages · 11 internal anchors

  1. [1]

    GPT-4 Technical Report

    Achiam, J., Adler, S., Agarwal, S., Ahmad, L., Akkaya, I., Aleman, F. L., Almeida, D., Altenschmidt, J., Altman, S., Anadkat, S., et al. Gpt-4 technical report.arXiv preprint arXiv:2303.08774,

  2. [2]

    Soft prompting for unlearning in large language models

    Bhaila, K., Van, M.-H., and Wu, X. Soft prompting for unlearning in large language models. InProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), pp. 4046–4056,

  3. [3]

    Who’s harry potter? approximate unlearning in llms.arXiv preprint arXiv:2310.02238,

    URL https: //arxiv.org/abs/2310.02238. Fang, J., Jiang, H., Wang, K., Ma, Y ., Jie, S., Wang, X., He, X., and Chua, T.-S. Alphaedit: Null-space constrained knowledge editing for language models.arXiv preprint arXiv:2410.02355,

  4. [4]

    The Llama 3 Herd of Models

    Grattafiori, A., Dubey, A., Jauhri, A., Pandey, A., Kadian, A., Al-Dahle, A., Letman, A., Mathur, A., Schelten, A., Vaughan, A., et al. The llama 3 herd of models.arXiv preprint arXiv:2407.21783,

  5. [5]

    Cer- tified data removal from machine learning models,

    Guo, C., Goldstein, T., Hannun, A., and Van Der Maaten, L. Certified data removal from machine learning models. arXiv preprint arXiv:1911.03030,

  6. [7]

    Measuring Massive Multitask Language Understanding

    URL https: //arxiv.org/abs/2009.03300. Huang, Z., Shen, Y ., Zhang, X., Zhou, J., Rong, W., and Xiong, Z. Transformer-patcher: One mistake worth one neuron.arXiv preprint arXiv:2301.09785,

  7. [8]

    Soul: Unlocking the power of second-order optimization for llm unlearning

    Jia, J., Zhang, Y ., Zhang, Y ., Liu, J., Runwal, B., Diffend- erfer, J., Kailkhura, B., and Liu, S. Soul: Unlocking the power of second-order optimization for llm unlearning. arXiv preprint arXiv:2404.18239,

  8. [9]

    Object Hallucination-Free Reinforcement Unlearning for Vision-Language Models

    URL https://arxiv.org/ abs/2605.08031. Levy, O., Seo, M., Choi, E., and Zettlemoyer, L. Zero- shot relation extraction via reading comprehension,

  9. [10]

    Zero-Shot Relation Extraction via Reading Comprehension

    URLhttps://arxiv.org/abs/1706.04115. Lin, Y ., Zhao, C., Shao, M., Meng, B., Zhao, X., and Chen, H. Towards counterfactual fairness-aware domain gen- eralization in changing environments.arXiv preprint arXiv:2309.13005,

  10. [11]

    Fade: Towards fairness-aware generation for domain general- ization via classifier-guided score-based diffusion models

    Lin, Y ., Li, D., Shao, M., Wan, G., and Zhao, C. Fade: Towards fairness-aware generation for domain general- ization via classifier-guided score-based diffusion models. arXiv preprint arXiv:2406.09495,

  11. [12]

    HCRE: LLM-based Hierarchical Classification for Cross-Document Relation Extraction with a Prediction-then-Verification Strategy

    URL https://openreview.net/forum? id=mUTN9VIaSy. Ma, G., Zhang, L., Tu, H., Fu, H., Li, H., Lin, Y ., Wang, L., Luo, W., and Su, J. Hcre: Llm-based hierarchical classification for cross-document relation extraction with a prediction-then-verification strategy.arXiv preprint arXiv:2604.07937,

  12. [13]

    TOFU: A Task of Fictitious Unlearning for LLMs

    Maini, P., Feng, Z., Schwarzschild, A., Lipton, Z. C., and Kolter, J. Z. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121,

  13. [14]

    Mass-Editing Memory in a Transformer

    Meng, K., Bau, D., Andonian, A., and Belinkov, Y . Locating and editing factual associations in gpt.Advances in neural information processing systems, 35:17359–17372, 2022a. Meng, K., Sharma, A. S., Andonian, A., Belinkov, Y ., and Bau, D. Mass-editing memory in a transformer.arXiv preprint arXiv:2210.07229, 2022b. Mitchell, E., Lin, C., Bosselut, A., Fin...

  14. [15]

    Scalable Extraction of Training Data from (Production) Language Models

    Nasr, M., Carlini, N., Hayase, J., Jagielski, M., Cooper, A. F., Ippolito, D., Choquette-Choo, C. A., Wallace, E., Tram`er, F., and Lee, K. Scalable extraction of training data from (production) language models.arXiv preprint arXiv:2311.17035,

  15. [16]

    Supervised algorithmic fairness in distribution shifts: A survey.arXiv preprint arXiv:2402.01327,

    Shao, M., Li, D., Zhao, C., Wu, X., Lin, Y ., and Tian, Q. Supervised algorithmic fairness in distribution shifts: A survey.arXiv preprint arXiv:2402.01327,

  16. [17]

    D., Ng, A., and Potts, C

    Socher, R., Perelygin, A., Wu, J., Chuang, J., Manning, C. D., Ng, A., and Potts, C. Recursive deep models for semantic compositionality over a sentiment treebank. In Yarowsky, D., Baldwin, T., Korhonen, A., Livescu, K., and Bethard, S. (eds.),Proceedings of the 2013 Confer- ence on Empirical Methods in Natural Language Process- ing, pp. 1631–1642, Seattl...

  17. [18]

    GLUE: A Multi-Task Benchmark and Analysis Platform for Natural Language Understanding

    URLhttps://arxiv.org/abs/1804.07461. Wang, X., Liu, X., Wang, L., Wu, S., Su, J., and Wu, H. A simple yet effective self-debiasing framework for transformer models.Artificial In- telligence, 339:104258,

  18. [19]

    doi: https://doi.org/10.1016/j.artint.2024.104258

    ISSN 0004-3702. doi: https://doi.org/10.1016/j.artint.2024.104258. URL https://www.sciencedirect.com/ science/article/pii/S0004370224001942. Warstadt, A., Singh, A., and Bowman, S. R. Neural network acceptability judgments.Transactions of the Association for Computational Linguistics, 7:625–641,

  19. [20]

    Unveiling the implicit toxicity in large lan- guage models.arXiv preprint arXiv:2311.17391,

    10 ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models Wen, J., Ke, P., Sun, H., Zhang, Z., Li, C., Bai, J., and Huang, M. Unveiling the implicit toxicity in large lan- guage models.arXiv preprint arXiv:2311.17391,

  20. [21]

    A broad- coverage challenge corpus for sentence understanding through inference

    Williams, A., Nangia, N., and Bowman, S. A broad- coverage challenge corpus for sentence understanding through inference. InProceedings of the 2018 confer- ence of the North American chapter of the association for computational linguistics: human language technologies, volume 1 (long papers), pp. 1112–1122,

  21. [22]

    Qwen3 Technical Report

    Yang, A., Li, A., Yang, B., Zhang, B., Hui, B., Zheng, B., Yu, B., Gao, C., Huang, C., Lv, C., et al. Qwen3 technical report.arXiv preprint arXiv:2505.09388,

  22. [23]

    Machine unlearning of pre-trained large language models

    Yao, J., Chien, E., Du, M., Niu, X., Wang, T., Cheng, Z., and Yue, X. Machine unlearning of pre-trained large language models.arXiv preprint arXiv:2402.15159, 2024a. Yao, Y ., Xu, X., and Liu, Y . Large language model unlearn- ing.Advances in Neural Information Processing Systems, 37:105425–105475, 2024b. Zhong, Z., Wu, Z., Manning, C. D., Potts, C., and ...

  23. [24]

    Zhu, C., Rawat, A

    URL https: //arxiv.org/abs/2305.14795. Zhu, C., Rawat, A. S., Zaheer, M., Bhojanapalli, S., Li, D., Yu, F., and Kumar, S. Modifying memories in transformer models.arXiv preprint arXiv:2012.00363,

  24. [25]

    over-correction

    11 ZeroUnlearn: Few-Shot Knowledge Unlearning in Large Language Models A. Notation Table 3.Summary of symbols used throughout the paper. Vectors and matrices are in bold. Symbol Meaning Df ={(x i, yi)}n i=1 Forget set (samples whose influence should be removed). fθ,θ∈ΘPre-trained language model parameterized byθ. U(·)Unlearning operator;θ ′ =U(θ,D f). θ′,...