pith. sign in

arxiv: 2605.16776 · v1 · pith:QNLXS4CHnew · submitted 2026-05-16 · 💻 cs.LG · cs.AI

Distinguishable Deletion: Unifying Knowledge Erasure and Refusal for Large Language Model Unlearning

Pith reviewed 2026-05-19 21:45 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords LLM unlearningknowledge erasurerefusal mechanismsenergy-based modelslatent representationssafe AIDistinguishable Deletion
0
0 comments X

The pith

Distinguishable Deletion unifies knowledge erasure and refusal by restricting response distributions in latent space for LLMs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces Distinguishable Deletion to solve issues in existing unlearning approaches for large language models. Knowledge deletion methods often bias by targeting specific tokens instead of fully removing knowledge, while refusal methods leave the knowledge intact and risk re-emergence. By restricting the response distribution in the latent representation and distinguishing unlearned from retained knowledge, the new paradigm enables both erasure and safe refusal. An energy index is used to measure this, supporting alignment during training and refusal at inference, with experiments showing better results.

Core claim

Distinguishable Deletion (D²) restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. This is implemented using an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses support that energy is accurate and efficient, allowing Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference.

What carries the argument

The energy index, which quantifies the presence of knowledge and the separation between unlearned and retained content in the latent representations.

Load-bearing premise

The energy index accurately quantifies the presence of knowledge and the separation between unlearned and retained content in latent representations.

What would settle it

An experiment where the energy index does not create a clear boundary, resulting in either incomplete erasure of sensitive knowledge or unintended suppression of retained knowledge.

Figures

Figures reproduced from arXiv: 2605.16776 by Bo Han, Junchi Yu, Philip Torr, Puning Yang, Qizhou Wang, Xiuying Chen.

Figure 1
Figure 1. Figure 1: Motivation and overview of our work. Left: Existing unlearning methods fall short in overall performance and general reliability: KD-based unlearning often produces unstable outputs, while DR-based unlearning is highly vulnerable to adversarial attacks. Right: These limitations in practicality and reliability motivate us to explore a new unlearning paradigm, Distinguishable Deletion (D 2 ), equipped with a… view at source ↗
Figure 2
Figure 2. Figure 2: Energy dynamics reveal the instability of KD-based unlearning and motivate EUA. (a) GradDiff reduces targeted logits while unconstrainedly increasing other-label logits. (b) This corresponds to decreasing and increasing negative energy, respectively; the overall energy first decreases and then increases, indicating a transition from under- to over-unlearning. (c)(d) Improved GradDiff-based methods achieve … view at source ↗
Figure 3
Figure 3. Figure 3: The robustness evaluations of EUA on TOFU with LLaMA-3.2-3B. Detailed values are shown in [PITH_FULL_IMAGE:figures/full_fig_p008_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Energy distribution for existing KD-based methods. Results are obtained on TOFU-5% with LLaMA-3.2-3B. Training trajectory for energy. Furthermore, we present additional training trajectories and analyze the training dynamics of KD-based methods through Maximum Softmax Probability (MSP), which provides a more interpretable perspective. As shown in [PITH_FULL_IMAGE:figures/full_fig_p017_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Trajectories of energy and Maximum Softmax Probability (MSP). Results are obtained on TOFU-5% using LLaMA-3.2-3B. Efficiency of the energy index. The proposed energy index provides an efficient mechanism for estimating knowledge presence. While several recent works pursue similar goals, they rely on sampling-based generation methods, which require multiple samples (Li et al., 2026) or repeated output sampl… view at source ↗
Figure 6
Figure 6. Figure 6: Trajectory results during training on TOFU benchmark. 50 100 150 200 250 300 350 Training Steps 0 10 20 30 40 50 60 Performance VerbMem KnowMem UtilPres (a) GradDiff 50 100 150 200 250 300 350 Training Steps 0 10 20 30 40 50 60 Performance VerbMem KnowMem UtilPres (b) WGA 50 100 150 200 250 300 350 Training Steps 0 10 20 30 40 50 60 Performance VerbMem KnowMem UtilPres (c) SatImp 50 100 150 200 250 300 350… view at source ↗
Figure 7
Figure 7. Figure 7: Trajectory results during training on MUSE benchmark. 24 [PITH_FULL_IMAGE:figures/full_fig_p024_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Energy changes under different attack strategies. We observe that the relearning attack is substantially more aggressive than prompt-based attacks. To further evaluate robustness, we conduct additional experiments on the MUSE benchmark and compare EUA with prior methods. Results shown in [PITH_FULL_IMAGE:figures/full_fig_p025_8.png] view at source ↗
read the original abstract

Mitigating sensitive and harmful outputs is fundamental to ensuring safe deployment of LLMs. Existing approaches typically follow two paradigms: Knowledge Deletion (KD), which erases undesirable information during training, and Distinguishable Refusal (DR), which steers models away from using sensitive knowledge during inference. Despite rapid progress, KD-based unlearning struggles with biased deletion due to suppressing specific token sequences as a substitute for complete knowledge removal, whereas DR-based unlearning risks the re-emergence of harmful knowledge because the underlying knowledge remains intact. To address these issues, we propose Distinguishable Deletion ($\mathrm{D^2}$), a paradigm that restricts the response distribution in the latent representation rather than specific tokens to erase undesirable knowledge, while distinguishing it from retained knowledge, enabling a refusal mechanism to handle unlearned inputs safely and coherently. To implement $\mathrm{D^2}$, we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference. Extensive experiments demonstrate that EUA significantly outperforms previous methods, indicating the superiority of $\mathrm{D^2}$. Our code is available at https://github.com/Puning97/EUA-for-LLM-Unlearning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 1 minor

Summary. The paper proposes Distinguishable Deletion (D²) to unify knowledge deletion (KD) and distinguishable refusal (DR) for LLM unlearning. It introduces an energy index that quantifies knowledge presence and separation in latent representations, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary restrictions during training and an energy-based refusal mechanism at inference. Mathematical and empirical analyses are claimed to show that energy is accurate and efficient, with extensive experiments demonstrating that EUA outperforms prior methods.

Significance. If the energy index reliably separates unlearned from retained knowledge, this could offer a more robust alternative to existing KD and DR approaches by addressing biased deletion and knowledge re-emergence. The code release supports reproducibility, which is a strength for validating the empirical claims.

major comments (2)
  1. [Abstract and §3] Abstract and §3: The central claim that the energy index enables complete erasure without collateral damage to retained knowledge rests on the untested assumption that unlearned content occupies a distinct energy region with no overlap. No derivation or analysis demonstrates preservation of this separation when a single fact is statistically entangled with retained facts via shared entities or reasoning chains in latent space.
  2. [Empirical Evaluation] Empirical section: Tables and experiments test only isolated deletion tasks. Without ablations on entangled knowledge scenarios, the reported superiority of EUA over baselines does not yet establish generalizability of the D² paradigm.
minor comments (1)
  1. [§3] The definition of the energy index would benefit from an explicit equation number referenced in the text when first introduced, to improve traceability of the mathematical analysis.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the detailed and constructive comments on our paper. We address each of the major comments below and outline the revisions we will make to strengthen the manuscript.

read point-by-point responses
  1. Referee: [Abstract and §3] Abstract and §3: The central claim that the energy index enables complete erasure without collateral damage to retained knowledge rests on the untested assumption that unlearned content occupies a distinct energy region with no overlap. No derivation or analysis demonstrates preservation of this separation when a single fact is statistically entangled with retained facts via shared entities or reasoning chains in latent space.

    Authors: We thank the referee for highlighting this important point. Our analysis in Section 3 derives the energy index based on the separation in latent representations and shows through mathematical formulation that it can quantify and enforce boundaries between unlearned and retained knowledge. However, we acknowledge that the specific case of statistical entanglement through shared entities or reasoning chains is not explicitly analyzed for preservation of separation. In the revised manuscript, we will add a discussion and partial derivation addressing how the energy-based alignment can maintain separation even in entangled scenarios by leveraging the global energy distribution rather than local token dependencies. revision: yes

  2. Referee: [Empirical Evaluation] Empirical section: Tables and experiments test only isolated deletion tasks. Without ablations on entangled knowledge scenarios, the reported superiority of EUA over baselines does not yet establish generalizability of the D² paradigm.

    Authors: We appreciate this feedback on the empirical evaluation. The current experiments follow the standard benchmarks in the unlearning literature, which primarily use isolated deletion tasks to measure effectiveness. We agree that testing on entangled knowledge scenarios is crucial for demonstrating the generalizability of our D² paradigm. Accordingly, we will include additional ablation studies in the revised version, where we construct entangled knowledge sets (e.g., deleting a specific fact while retaining related facts sharing entities) and evaluate the performance of EUA compared to baselines in these settings. revision: yes

Circularity Check

1 steps flagged

Self-introduced energy index creates moderate circularity in unlearning claims

specific steps
  1. self definitional [Abstract]
    "To implement D², we introduce an energy index that quantifies the presence of knowledge and the separation between unlearned and retained content. Mathematical and empirical analyses show that energy is both accurate and efficient, enabling Energy-based Unlearning Alignment (EUA) to enforce energy-boundary unlearning during training and apply an energy-based refusal mechanism at inference."

    The energy index is introduced by definition to quantify exactly the presence and separation needed for D²; the subsequent claim that analyses show the index is 'accurate' for enforcing unlearning therefore reduces to verifying properties built into the definition itself rather than an independent test of whether such a scalar proxy exists in entangled representations.

full rationale

The paper introduces an energy index specifically to implement the D² paradigm by quantifying knowledge presence and separation in latent space, then uses mathematical and empirical analyses to validate that this index enables accurate boundary enforcement and refusal. This creates a moderate circularity burden because the claimed accuracy and efficiency of the index are evaluated against the separation properties it was defined to produce, rather than against fully independent external benchmarks for entangled knowledge. However, the central derivation still contains independent empirical experiments on deletion tasks and comparisons to prior methods, so the overall result is not fully forced by construction. No self-citation chains or uniqueness theorems are load-bearing here.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the effectiveness of the newly introduced energy index for quantifying and separating knowledge in latent space; this is a domain assumption without independent external validation shown in the abstract.

axioms (1)
  • domain assumption Restricting response distribution in latent representation erases undesirable knowledge while preserving distinction from retained knowledge.
    Core premise of D² as stated in the abstract.
invented entities (1)
  • Energy index no independent evidence
    purpose: Quantifies presence of knowledge and separation between unlearned and retained content.
    Introduced to implement D² and EUA; no independent evidence outside the paper's analyses.

pith-pipeline@v0.9.0 · 5792 in / 1321 out tokens · 35579 ms · 2026-05-19T21:45:07.220707+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

130 extracted references · 130 canonical work pages · 21 internal anchors

  1. [1]

    GPT-4 Technical Report

    Gpt-4 technical report , author=. arXiv preprint arXiv:2303.08774 , year=

  2. [2]

    DeepSeek-V3 Technical Report

    Deepseek-v3 technical report , author=. arXiv preprint arXiv:2412.19437 , year=

  3. [3]

    Arxiv Preprint , year=

    What Is Preference Optimization Doing, How and Why? , author=. Arxiv Preprint , year=

  4. [4]

    International Conference on Learning Representations , year=

    Explainable LLM Unlearning through Reasoning , author=. International Conference on Learning Representations , year=

  5. [5]

    ICML , year=

    In-Context Unlearning: Language Models as Few-Shot Unlearners , author=. ICML , year=

  6. [6]

    Nature Machine Intelligence , year=

    Rethinking machine unlearning for large language models , author=. Nature Machine Intelligence , year=

  7. [7]

    Proceedings of the ACM collective intelligence conference , year=

    Gender bias and stereotypes in large language models , author=. Proceedings of the ACM collective intelligence conference , year=

  8. [8]

    EMNLP , year=

    Copyright violations and large language models , author=. EMNLP , year=

  9. [9]

    Public Choice , year=

    More human than human: measuring ChatGPT political bias , author=. Public Choice , year=

  10. [10]

    Scalable Extraction of Training Data from (Production) Language Models

    Scalable extraction of training data from (production) language models , author=. arXiv preprint arXiv:2311.17035 , year=

  11. [11]

    ICML , year=

    GRU: Mitigating the Trade-off between Unlearning and Retention for LLMs , author=. ICML , year=

  12. [12]

    ICML , year=

    Position: Trustllm: Trustworthiness in large language models , author=. ICML , year=

  13. [13]

    CoLM , year=

    Tofu: A task of fictitious unlearning for llms , author=. CoLM , year=

  14. [14]

    CoLM , year=

    Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. CoLM , year=

  15. [15]

    ACL , year=

    Knowledge unlearning for mitigating privacy risks in language models , author=. ACL , year=

  16. [16]

    ICLR , year=

    Rethinking LLM Unlearning Objectives: A Gradient Perspective and Go Beyond , author=. ICLR , year=

  17. [17]

    ICML , year=

    Exploring Criteria of Loss Reweighting to Enhance LLM Unlearning , author=. ICML , year=

  18. [18]

    NeurIPS , year=

    Lunar: Llm unlearning via neural activation redirection , author=. NeurIPS , year=

  19. [19]

    arXiv preprint arXiv:2504.05058 , year=

    Not all data are unlearned equally , author=. arXiv preprint arXiv:2504.05058 , year=

  20. [20]

    ICLR , year=

    A Probabilistic Perspective on Unlearning and Alignment for Large Language Models , author=. ICLR , year=

  21. [21]

    arXiv preprint arXiv:2509.24675 , year=

    Understanding the Dilemma of Unlearning for Large Language Models , author=. arXiv preprint arXiv:2509.24675 , year=

  22. [22]

    ICLR , year=

    LLM Unlearning with LLM Beliefs , author=. ICLR , year=

  23. [23]

    arXiv preprint arXiv:2511.04934 , year=

    Leak@ k : Unlearning Does Not Make LLMs Forget Under Probabilistic Decoding , author=. arXiv preprint arXiv:2511.04934 , year=

  24. [24]

    EMNLP , year=

    A Fully Probabilistic Perspective on Large Language Model Unlearning: Evaluation and Optimization , author=. EMNLP , year=

  25. [25]

    S&P , year=

    Learn what you want to unlearn: Unlearning inversion attacks against machine unlearning , author=. S&P , year=

  26. [26]

    2025 , howpublished =

    Introducing GPT-5.2 , author =. 2025 , howpublished =

  27. [27]

    do anything now

    " do anything now": Characterizing and evaluating in-the-wild jailbreak prompts on large language models , author=. Proceedings of the 2024 on ACM SIGSAC Conference on Computer and Communications Security , year=

  28. [28]

    ICML , year=

    Towards LLM Unlearning Resilient to Relearning Attacks: A Sharpness-Aware Minimization Perspective and Beyond , author=. ICML , year=

  29. [29]

    URL https://openreview.net/forum?id= J5IRyTKZ9s

    Eight methods to evaluate robust unlearning in llms , author=. arXiv preprint arXiv:2402.16835 , year=

  30. [30]

    Predicting Structured Data , year=

    A tutorial on energy-based learning , author=. Predicting Structured Data , year=

  31. [31]

    LLaMA: Open and Efficient Foundation Language Models

    Touvron, Hugo and Lavril, Thibaut and Izacard, Gautier and Martinet, Xavier and Lachaux, Marie-Anne and Lacroix, Timoth. arXiv preprint arXiv:2302.13971 , year=

  32. [32]

    ICML 2025 Workshop MUGen , year=

    On the Fragility of Latent Knowledge: Layer-wise Influence under Unlearning in Large Language Model , author=. ICML 2025 Workshop MUGen , year=

  33. [33]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    Llama 2: Open foundation and fine-tuned chat models , author=. arXiv preprint arXiv:2307.09288 , year=

  34. [34]

    Simplicity Prevails: Rethinking Negative Preference Optimization for

    Fan, Chongyu and Liu, Jiancheng and Lin, Licong and Jia, Jinghan and Zhang, Ruiqi and Mei, Song and Liu, Sijia , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for

  35. [35]

    ICALP , year=

    Differential privacy , author=. ICALP , year=

  36. [36]

    CVPR , year=

    Eternal sunshine of the spotless net: Selective forgetting in deep networks , author=. CVPR , year=

  37. [37]

    S&P , year=

    Towards making systems forget with machine unlearning , author=. S&P , year=

  38. [38]

    ACL , year=

    Unsupervised word sense disambiguation rivaling supervised methods , author=. ACL , year=

  39. [39]

    Large language model safety: A holistic survey,

    Large language model safety: A holistic survey , author=. arXiv preprint arXiv:2412.17686 , year=

  40. [40]

    NeurIPS , year=

    A Sober Look at the Robustness of CLIPs to Spurious Features , author=. NeurIPS , year=

  41. [41]

    A comprehensive survey of machine unlearning techniques for large language models.arXiv preprint arXiv:2503.01854, 2025

    A comprehensive survey of machine unlearning techniques for large language models , author=. arXiv preprint arXiv:2503.01854 , year=

  42. [42]

    Sutherland , booktitle=

    Yi Ren and Danica J. Sutherland , booktitle=. Learning Dynamics of

  43. [43]

    NeurIPS , year=

    Pareto multi-task learning , author=. NeurIPS , year=

  44. [44]

    NeurIPS , year=

    Gradient episodic memory for continual learning , author=. NeurIPS , year=

  45. [45]

    Unrolling

    Thudi, Anvith and Deza, Gabriel and Chandrasekaran, Varun and Papernot, Nicolas , booktitle=. Unrolling

  46. [46]

    The Platonic Representation Hypothesis

    The platonic representation hypothesis , author=. arXiv preprint arXiv:2405.07987 , year=

  47. [47]

    Textbooks Are All You Need II: phi-1.5 technical report

    Li, Yuanzhi and Bubeck, S. Textbooks are all you need. arXiv preprint arXiv:2309.05463 , year=

  48. [48]

    Proximal Policy Optimization Algorithms

    Proximal policy optimization algorithms , author=. arXiv preprint arXiv:1707.06347 , year=

  49. [49]

    ORPO: Monolithic Preference Optimization without Reference Model

    Orpo: Monolithic preference optimization without reference model , author=. arXiv preprint arXiv:2403.07691 , year=

  50. [50]

    NeurIPS , year=

    Direct preference optimization: Your language model is secretly a reward model , author=. NeurIPS , year=

  51. [51]

    Towards Effective Evaluations and Comparisons for

    Qizhou Wang and Bo Han and Puning Yang and Jianing Zhu and Tongliang Liu and Masashi Sugiyama , booktitle=. Towards Effective Evaluations and Comparisons for

  52. [52]

    arXiv preprint arXiv:2402.00888 , year=

    Security and privacy challenges of large language models: A survey , author=. arXiv preprint arXiv:2402.00888 , year=

  53. [53]

    Can sensitive information be deleted from

    Patil, Vaidehi and Hase, Peter and Bansal, Mohit , booktitle=. Can sensitive information be deleted from

  54. [54]

    and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and others , booktitle=

    Li, Nathaniel and Pan, Alexander and Gopal, Anjali and Yue, Summer and Berrios, Daniel and Gatti, Alice and Li, Justin D. and Dombrowski, Ann-Kathrin and Goel, Shashwat and Mukobi, Gabriel and others , booktitle=. The

  55. [55]

    A Survey of Large Language Models

    A survey of large language models , author=. arXiv preprint arXiv:2303.18223 , year=

  56. [56]

    Generating Wikipedia by Summarizing Long Sequences

    Generating wikipedia by summarizing long sequences , author=. arXiv preprint arXiv:1801.10198 , year=

  57. [57]

    CVPR , year=

    Robust fine-tuning of zero-shot models , author=. CVPR , year=

  58. [58]

    NeurIPS , year=

    Language models are few-shot learners , author=. NeurIPS , year=

  59. [59]

    BloombergGPT: A Large Language Model for Finance

    Bloomberggpt: A large language model for finance , author=. arXiv preprint arXiv:2303.17564 , year=

  60. [60]

    ACL , year=

    Automatic evaluation of machine translation quality using longest common subsequence and skip-bigram statistics , author=. ACL , year=

  61. [61]

    ACL , year=

    Quantifying privacy risks of masked language models using membership inference attacks , author=. ACL , year=

  62. [62]

    NeurIPS , year=

    Training language models to follow instructions with human feedback , author=. NeurIPS , year=

  63. [63]

    Sharpness-Aware Minimization for Efficiently Improving Generalization

    Sharpness-aware minimization for efficiently improving generalization , author=. arXiv preprint arXiv:2010.01412 , year=

  64. [64]

    ICLR , year=

    Quantifying memorization across neural language models , author=. ICLR , year=

  65. [65]

    NeurIPS , year=

    Positive-unlabeled learning with non-negative risk estimator , author=. NeurIPS , year=

  66. [66]

    OPT: Open Pre-trained Transformer Language Models

    Opt: Open pre-trained transformer language models , author=. arXiv preprint arXiv:2205.01068 , year=

  67. [67]

    NeurIPS , year=

    Task arithmetic in the tangent space: Improved editing of pre-trained models , author=. NeurIPS , year=

  68. [68]

    ICLR , year=

    Decoupled weight decay regularization , author=. ICLR , year=

  69. [69]

    NeurIPS , year=

    Knowledge Circuits in Pretrained Transformers , author=. NeurIPS , year=

  70. [70]

    EMNLP , year=

    Knowledge Conflicts for LLMs: A Survey , author=. EMNLP , year=

  71. [71]

    Provably robust

    Chowdhury, Sayak Ray and Kini, Anush and Natarajan, Nagarajan , booktitle=. Provably robust

  72. [72]

    arXiv preprint arXiv:2104.08164 , year=

    Editing factual knowledge in language models , author=. arXiv preprint arXiv:2104.08164 , year=

  73. [73]

    Reinforcement Learning for LLM Post-Training: A Survey

    A Comprehensive Survey of LLM Alignment Techniques: RLHF, RLAIF, PPO, DPO and More , author=. arXiv preprint arXiv:2407.16216 , year=

  74. [74]

    Eliciting Latent Predictions from Transformers with the Tuned Lens

    Eliciting latent predictions from transformers with the tuned lens , author=. arXiv preprint arXiv:2303.08112 , year=

  75. [75]

    EMNLP , year=

    Transformer feed-forward layers are key-value memories , author=. EMNLP , year=

  76. [76]

    USENIX Security , year=

    Extracting training data from large language models , author=. USENIX Security , year=

  77. [77]

    S&P , year=

    Machine unlearning , author=. S&P , year=

  78. [78]

    NeurIPS , year=

    Large Language Model Unlearning , author=. NeurIPS , year=

  79. [79]

    Code Llama: Open Foundation Models for Code

    Code llama: Open foundation models for code , author=. arXiv preprint arXiv:2308.12950 , year=

  80. [80]

    Trustworthy LLMs: a Survey and Guideline for Evaluating Large Language Models' Alignment

    Trustworthy LLMs: A Survey and Guideline for Evaluating Large Language Models' Alignment , author=. arXiv preprint arXiv:2308.05374 , year=

Showing first 80 references.