pith. sign in

arxiv: 2606.27683 · v1 · pith:TCOCTDFJnew · submitted 2026-06-26 · 💻 cs.LG · cs.AI

CBD: API-Only LLM Black-Box Unlearning through Controlled Behavioral Divergence

Pith reviewed 2026-06-29 04:44 UTC · model grok-4.3

classification 💻 cs.LG cs.AI
keywords machine unlearningblack-box unlearninglarge language modelsAPI-only accessbehavioral divergenceFisher matrixgeneralized eigenvalue problemToFU benchmark
0
0 comments X

The pith

CBD enables API-only black-box LLM unlearning by routing prompts with behavioral divergence scores from a Fisher-matrix discriminative basis.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that machine unlearning for large language models can succeed under API-only black-box constraints even when target and retained data share similar prompt structures or semantics. It introduces Controlled Behavioral Divergence (CBD) that deploys auxiliary models to generate measurable divergence, converts the divergence into relevance scores for routing unlearning prompts away from the target model, and builds a statistical basis from empirical Fisher matrices to focus signals on target-specific content. A sympathetic reader would care because many practical LLM deployments occur through restricted APIs where retraining or internal access is impossible, yet sensitive or harmful data must still be removed. The approach reports stronger unlearning-utility tradeoffs than eleven prior white-box and gray-box methods.

Core claim

CBD uses two auxiliary models to produce controlled behavioral divergence between retained inputs and unlearning targets, derives an unlearning relevance score from that divergence, and routes related prompts away from the target LLM. When retained and target data are highly similar, it constructs a discriminative basis by estimating empirical Fisher matrices from API outputs and solving a regularized generalized eigenvalue problem to direct the signal toward target-specific information rather than shared structures. This yields performance approaching the retrained reference on the forget set while achieving 74.90 utility on ToFU forget10 (15 percent above the second-best baseline) and 25.6

What carries the argument

The discriminative basis obtained by estimating empirical Fisher matrices from API outputs and solving a regularized generalized eigenvalue problem; it isolates target-specific information within the behavioral divergence signal.

If this is right

  • Approaches retrained reference performance on the forget set of ToFU forget10.
  • Raises model utility to 74.90 on ToFU forget10, exceeding the second-best baseline by about 15 percent.
  • Lowers hazardous knowledge accuracy to 25.68 on WMDP, near random guessing levels.
  • Preserves MMLU accuracy at 52.67 during unlearning on WMDP.
  • Shows little performance variation across settings compared with prior baselines.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • The routing mechanism could allow removal of specific data influences from commercial LLM APIs without any provider-side cooperation or model access.
  • The same Fisher-matrix basis construction might transfer to black-box unlearning tasks in non-language domains such as vision models.
  • Further tests on data distributions with greater semantic overlap than ToFU or WMDP would clarify the practical boundary of the discrimination step.
  • Edge-device scenarios that collect user data for later unlearning could adopt the auxiliary-model divergence step without needing local model copies.

Load-bearing premise

That estimating empirical Fisher matrices from API outputs and solving the regularized generalized eigenvalue problem produces a basis that isolates target-specific information rather than shared prompt structures when retained and unlearning data are highly similar.

What would settle it

On a dataset with highly overlapping semantics between retained and target data, the method fails to bring forget-set metrics close to the retrained reference or drops retained utility below the second-best baseline.

Figures

Figures reproduced from arXiv: 2606.27683 by Dong In Kim, Yijing Lin, Zhipeng Gao, Zhiqiang Xie.

Figure 1
Figure 1. Figure 1: Architecture comparison of machine unlearning methods for LLMs. [PITH_FULL_IMAGE:figures/full_fig_p001_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Framework of Controlled Behavioral Divergence (CBD) for black-box unlearning. [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Coupled forget-side and retain-side loss trajectories during forget-side [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Hyperparameter sensitivity of CBD on ToFU [PITH_FULL_IMAGE:figures/full_fig_p012_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Cross-method step stability on ToFU forget10 and WMDP. The heatmaps illustrate the stability of unlearning utility across training steps for different methods. TABLE VI EXPERIMENTAL RESULTS ON WMDP. WITHIN EACH METHOD GROUP, THE BEST VALUE IN EACH COLUMN IS BOLDFACED. Method Overall↓ Bio↓ Cyber↓ Chem↓ MMLU↑ Standard Target LLM 50.95 65.28 42.17 49.02 59.01 White-box GA 30.97 37.86 27.63 25.74 33.44 GA+GD 3… view at source ↗
Figure 6
Figure 6. Figure 6: Receiver operating characteristic curves comparing CBD (with DFB) [PITH_FULL_IMAGE:figures/full_fig_p013_6.png] view at source ↗
read the original abstract

Edge devices increasingly invoke large language models (LLMs) through API services for context aware edge intelligence, while edge generated data may be collected to improve LLMs and may introduce sensitive, copyrighted, harmful, or outdated information into model behavior. Machine unlearning offers a practical way to remove the influence of undesired data without retraining LLMs. However, existing methods still face two gaps. The first is API only black box access, where target model parameters and internal logits are unavailable. The second is how to preserve retained utility when unlearning target data and retained data share highly similar prompt structures or semantic patterns. To address these challenges, we propose Controlled Behavioral Divergence (CBD), an API only black box unlearning framework. CBD uses two auxiliary models to create controlled behavioral divergence between retained inputs and unlearning target inputs, converts this divergence into an unlearning relevance score, and routes unlearning related prompts away from the target LLM. To improve discrimination accuracy under high similarity between target and retained data, CBD constructs a gradient statistics based discriminative basis by estimating empirical Fisher matrices and solving a regularized generalized eigenvalue problem, guiding the unlearning signal toward target specific information rather than shared prompt structures. Compared with eleven white box and gray box unlearning baselines, CBD achieves a better unlearning utility trade off and its performance varies little across settings. On ToFU forget10, CBD approaches the retrained reference on the forget set while raising model utility to 74.90, about 15% above the second best baseline. On WMDP, it lowers hazardous knowledge accuracy to 25.68, near random guessing, while preserving MMLU accuracy of 52.67. Code is at https://github.com/DGL-codes/CBD.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The manuscript presents CBD, an API-only black-box unlearning framework for LLMs. It uses two auxiliary models to induce controlled behavioral divergence between retained and target inputs, converts the divergence into an unlearning relevance score, and routes prompts away from the target model. To handle high similarity between retained and unlearning data, it constructs a discriminative basis by estimating empirical Fisher matrices from API outputs and solving a regularized generalized eigenvalue problem. Evaluations on ToFU forget10 report utility of 74.90 (approaching retrained reference, ~15% above second-best baseline) and on WMDP report hazardous knowledge accuracy of 25.68 with MMLU accuracy of 52.67, outperforming eleven baselines.

Significance. If the central claim holds—that the Fisher-based discriminative basis reliably isolates target-specific signals even under high prompt similarity—this would address a practical gap in black-box unlearning for API-accessed LLMs on edge devices. The open code repository is a clear strength for reproducibility.

major comments (2)
  1. [Method (discriminative basis construction)] Method section on discriminative basis: the construction via empirical Fisher matrices from API outputs followed by the regularized generalized eigenvalue problem supplies no derivation, bound, or orthogonality argument showing that the leading directions are dominated by target-specific components rather than shared prompt structures. This is load-bearing for the discrimination accuracy claim under the high-similarity regime of ToFU.
  2. [Experiments (ToFU and WMDP evaluations)] Experimental results (ToFU forget10 and WMDP tables): the headline numbers (utility 74.90; hazardous accuracy 25.68) rest on the unverified conversion of divergence scores and the eigenvalue routing step; without details on how the regularized problem is solved or any post-hoc parameter choices, internal consistency of the reported gains cannot be assessed.
minor comments (2)
  1. [Abstract] Abstract: the acronym ToFU is used without expansion or reference on first appearance.
  2. [Experiments] The list of eleven baselines would benefit from an explicit table or appendix entry for each method and its access type (white/gray-box).

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments, which highlight important areas for strengthening the theoretical and implementation details of CBD. We address each major comment below and will revise the manuscript to incorporate the requested clarifications and derivations.

read point-by-point responses
  1. Referee: Method section on discriminative basis: the construction via empirical Fisher matrices from API outputs followed by the regularized generalized eigenvalue problem supplies no derivation, bound, or orthogonality argument showing that the leading directions are dominated by target-specific components rather than shared prompt structures. This is load-bearing for the discrimination accuracy claim under the high-similarity regime of ToFU.

    Authors: We agree that the manuscript would benefit from a formal derivation. In the revision, we will add an appendix deriving the regularized generalized eigenvalue problem, including a decomposition of the Fisher matrices into target-specific and shared components, along with an orthogonality argument showing that the leading directions prioritize target signals under high prompt similarity. This will directly support the discrimination claims for ToFU. revision: yes

  2. Referee: Experimental results (ToFU forget10 and WMDP tables): the headline numbers (utility 74.90; hazardous accuracy 25.68) rest on the unverified conversion of divergence scores and the eigenvalue routing step; without details on how the regularized problem is solved or any post-hoc parameter choices, internal consistency of the reported gains cannot be assessed.

    Authors: We concur that additional implementation details are necessary for assessing reproducibility. The revised manuscript will include: (i) the specific solver and algorithm used for the regularized generalized eigenvalue problem, (ii) the procedure for selecting the regularization parameter, (iii) pseudocode for converting divergence scores to routing decisions, and (iv) any post-hoc parameter choices. These additions will allow verification of the reported results. revision: yes

Circularity Check

0 steps flagged

No circularity detected in claimed method or results

full rationale

The paper describes an empirical framework (CBD) that estimates Fisher matrices from API outputs and solves a regularized generalized eigenvalue problem to build a discriminative basis. No equations, derivations, or performance claims are shown to reduce by construction to fitted inputs, self-citations, or renamed known results. Reported metrics (e.g., ToFU utility 74.90, WMDP hazardous accuracy 25.68) are presented as experimental outcomes versus baselines, with no load-bearing self-citation chains or self-definitional steps in the provided text.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

Review performed on abstract only; ledger entries are inferred from the high-level method description.

free parameters (1)
  • regularization parameter in generalized eigenvalue problem
    Mentioned as part of the discriminative basis construction; value not given in abstract.
axioms (1)
  • domain assumption Empirical Fisher matrices estimated from API-accessible outputs approximate the information needed to isolate target-specific gradients.
    Invoked to construct the discriminative basis when prompt structures overlap.

pith-pipeline@v0.9.1-grok · 5856 in / 1274 out tokens · 55315 ms · 2026-06-29T04:44:37.618951+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

42 extracted references · 4 canonical work pages · 3 internal anchors

  1. [1]

    Mobile edge intelligence for large language models: A contemporary survey,

    G. Qu, Q. Chen, W. Wei, Z. Lin, X. Chen, and K. Huang, “Mobile edge intelligence for large language models: A contemporary survey,”IEEE Communications Surveys & Tutorials, vol. 27, pp. 3820–3860, 2025

  2. [2]

    Extracting training data from large language models,

    N. Carliniet al., “Extracting training data from large language models,” in30th USENIX security symposium (USENIX Security 21), 2021, pp. 2633–2650

  3. [3]

    Machine unlearning of pre-trained large language models,

    J. Yaoet al., “Machine unlearning of pre-trained large language models,” inProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2024, pp. 8403– 8419

  4. [4]

    Rethinking machine unlearning for large language models,

    S. Liuet al., “Rethinking machine unlearning for large language models,”Nature Machine Intelligence, vol. 7, pp. 181–194, 2025

  5. [5]

    Large language model unlearning,

    Y . Yao, X. Xu, and Y . Liu, “Large language model unlearning,”Advances in Neural Information Processing Systems, vol. 37, pp. 105 425–105 475, 2024

  6. [6]

    Negative preference optimization: From catastrophic collapse to effective unlearning,

    R. Zhang, L. Lin, Y . Bai, and S. Mei, “Negative preference optimization: From catastrophic collapse to effective unlearning,” inProceedings of the First Conference on Language Modeling, 2024

  7. [7]

    Reversing the forget-retain objectives: An efficient LLM unlearning framework from logit difference,

    J. Jiet al., “Reversing the forget-retain objectives: An efficient LLM unlearning framework from logit difference,”Advances in Neural Infor- mation Processing Systems, vol. 37, pp. 12 581–12 611, 2024

  8. [8]

    Offset unlearning for large language models,

    J. Y . Huanget al., “Offset unlearning for large language models,” Transactions on Machine Learning Research, 2025

  9. [9]

    Gradient projection memory for continual learning,

    G. Saha, I. Garg, and K. Roy, “Gradient projection memory for continual learning,” inInternational Conference on Learning Representations, 2021, pp. 944–961

  10. [10]

    Machine unlearning,

    L. Bourtouleet al., “Machine unlearning,” in2021 IEEE symposium on security and privacy (SP), 2021, pp. 141–159

  11. [11]

    Knowledge unlearning for mitigating privacy risks in language models,

    J. Janget al., “Knowledge unlearning for mitigating privacy risks in language models,” inProceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), 2023, pp. 14 389–14 408

  12. [12]

    Unlearn what you want to forget: Efficient unlearning for LLMs,

    J. Chen and D. Yang, “Unlearn what you want to forget: Efficient unlearning for LLMs,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 12 041– 12 052

  13. [13]

    Can sensitive information be deleted from LLMs? objectives for defending against extraction attacks,

    V . Patil, P. Hase, and M. Bansal, “Can sensitive information be deleted from LLMs? objectives for defending against extraction attacks,” in International Conference on Learning Representations, 2024

  14. [14]

    To forget or not? towards practical knowledge unlearning for large language models,

    B. Tianet al., “To forget or not? towards practical knowledge unlearning for large language models,” inFindings of the Association for Compu- tational Linguistics: EMNLP 2024, 2024, pp. 1524–1537

  15. [15]

    Simplicity prevails: Rethinking negative preference optimization for LLM unlearning,

    C. Fanet al., “Simplicity prevails: Rethinking negative preference optimization for LLM unlearning,” inAdvances in Neural Information Processing Systems, 2025

  16. [16]

    SOUL: Unlocking the power of second-order optimization for LLM unlearning,

    J. Jiaet al., “SOUL: Unlocking the power of second-order optimization for LLM unlearning,” inProceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, 2024, pp. 4276– 4292

  17. [17]

    W AGLE: Strategic weight attribution for effective and modular unlearning in large language models,

    J. Jia, J. Liu, Y . Zhang, P. Ram, N. Baracaldo, and S. Liu, “W AGLE: Strategic weight attribution for effective and modular unlearning in large language models,” inAdvances in Neural Information Processing Systems, 2024

  18. [18]

    Unilogit: Robust machine unlearning for LLMs using uniform-target self-distillation,

    S. Vasilev, C. Herold, B. Liao, S. H. Hashemi, S. Khadivi, and C. Monz, “Unilogit: Robust machine unlearning for LLMs using uniform-target self-distillation,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 22 453–22 472

  19. [19]

    A general framework to enhance fine-tuning-based LLM unlearning,

    J. Renet al., “A general framework to enhance fine-tuning-based LLM unlearning,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 18 464–18 476

  20. [20]

    On large language model continual unlearning,

    C. Gao, L. Wang, K. Ding, C. Weng, X. Wang, and Q. Zhu, “On large language model continual unlearning,” inInternational Conference on Learning Representations, 2025

  21. [21]

    Avoiding copyright infringement via large language model unlearning,

    G. Dou, Z. Liu, Q. Lyu, K. Ding, and E. Wong, “Avoiding copyright infringement via large language model unlearning,” inFindings of the Association for Computational Linguistics: NAACL 2025, 2025, pp. 5191–5215

  22. [22]

    From evasion to concealment: Stealthy knowledge un- learning for LLMs,

    T. Guet al., “From evasion to concealment: Stealthy knowledge un- learning for LLMs,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 10 261–10 279

  23. [23]

    Not every token needs forgetting: Selective unlearning balancing forgetting and utility in large language models,

    Y . Wan, A. Ramakrishna, K.-W. Chang, V . Cevher, and R. Gupta, “Not every token needs forgetting: Selective unlearning balancing forgetting and utility in large language models,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 1827–1835

  24. [24]

    Reveal and release: Iterative LLM unlearning with self-generated data,

    L. Xie, X. Teng, S. Ke, H. Wen, and S. Wan, “Reveal and release: Iterative LLM unlearning with self-generated data,” inFindings of the Association for Computational Linguistics: EMNLP 2025, 2025, pp. 23 887–23 899

  25. [25]

    Rectifying belief space via unlearn- ing to harness LLMs’ reasoning,

    A. Niwa, M. Kaneko, and K. Inui, “Rectifying belief space via unlearn- ing to harness LLMs’ reasoning,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 25 060–25 075

  26. [26]

    In-context unlearning: Language models as few-shot unlearners,

    M. Pawelczyk, S. Neel, and H. Lakkaraju, “In-context unlearning: Language models as few-shot unlearners,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 235, 2024, pp. 40 034–40 050

  27. [27]

    Answer when needed, forget when not: Language models pretend to forget via in-context knowledge unlearning,

    S. Takashiro, T. Kojima, A. Gambardella, Q. Cao, Y . Iwasawa, and Y . Matsuo, “Answer when needed, forget when not: Language models pretend to forget via in-context knowledge unlearning,” inFindings of the Association for Computational Linguistics: ACL 2025, 2025, pp. 24 872–24 885

  28. [28]

    Soft prompting for unlearning in large language models,

    K. Bhaila, M.-H. Van, and X. Wu, “Soft prompting for unlearning in large language models,” inProceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers), 2025, pp. 4046–4056

  29. [29]

    Large language model unlearning via embedding-corrupted prompts,

    C. Y . Liu, Y . Wang, J. Flanigan, and Y . Liu, “Large language model unlearning via embedding-corrupted prompts,”Advances in Neural Information Processing Systems, vol. 37, 2024

  30. [30]

    Fast exact unlearning for in-context learning data for LLMs,

    A. I. Muresanu, A. Thudi, M. R. Zhang, and N. Papernot, “Fast exact unlearning for in-context learning data for LLMs,” inProceedings of the 42nd International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 267, 2025, pp. 45 272–45 288

  31. [31]

    Direct preference optimization: Your language model is secretly a reward model,

    R. Rafailov, A. Sharma, E. Mitchell, C. D. Manning, S. Ermon, and C. Finn, “Direct preference optimization: Your language model is secretly a reward model,”Advances in Neural Information Processing Systems, vol. 36, pp. 53 728–53 741, 2023

  32. [32]

    Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models,

    A. Kassem, O. Mahmoud, and S. Saad, “Preserving privacy through dememorization: An unlearning technique for mitigating memorization risks in language models,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 4360– 4379

  33. [33]

    LLM unlearning via loss adjustment with only forget data,

    Y . Wanget al., “LLM unlearning via loss adjustment with only forget data,” inInternational Conference on Learning Representations, 2025

  34. [34]

    LoRA: Low-rank adaptation of large language models,

    E. J. Huet al., “LoRA: Low-rank adaptation of large language models,” inInternational Conference on Learning Representations, 2022

  35. [35]

    New insights and perspectives on the natural gradient method,

    J. Martens, “New insights and perspectives on the natural gradient method,”Journal of Machine Learning Research, vol. 21, no. 146, pp. 1–76, 2020

  36. [36]

    TOFU: A task of fictitious unlearning for LLMs,

    P. Maini, Z. Feng, A. Schwarzschild, Z. C. Lipton, and J. Z. Kolter, “TOFU: A task of fictitious unlearning for LLMs,” inFirst Conference on Language Modeling, 2024

  37. [37]

    The WMDP benchmark: Measuring and reducing ma- licious use with unlearning,

    N. Liet al., “The WMDP benchmark: Measuring and reducing ma- licious use with unlearning,” inProceedings of the 41st International Conference on Machine Learning, ser. Proceedings of Machine Learning Research, vol. 235, 2024, pp. 28 525–28 550

  38. [38]

    Llama 2: Open Foundation and Fine-Tuned Chat Models

    H. Touvronet al., “Llama 2: Open foundation and fine-tuned chat models,”arXiv preprint arXiv:2307.09288, 2023

  39. [39]

    Measuring massive multitask language understand- ing,

    D. Hendryckset al., “Measuring massive multitask language understand- ing,” inInternational Conference on Learning Representations, 2021

  40. [40]

    Zephyr: Direct Distillation of LM Alignment

    L. Tunstallet al., “Zephyr: Direct distillation of LM alignment,”arXiv preprint arXiv:2310.16944, 2023

  41. [41]

    TinyLlama: An Open-Source Small Language Model

    P. Zhang, G. Zeng, T. Wang, and W. Lu, “Tinyllama: An open-source small language model,”arXiv preprint arXiv:2401.02385, 2024

  42. [42]

    Erasing without remembering: Implicit knowledge forgetting in large language models,

    H. Wanget al., “Erasing without remembering: Implicit knowledge forgetting in large language models,”arXiv preprint arXiv:2502.19982, 2025