arxiv: 2605.08730 · v1 · submitted 2026-05-09 · 💻 cs.LG · cs.CR

Recognition: no theorem link

Classification-Head Bias in Class-Level Machine Unlearning: Diagnosis, Mitigation, and Evaluation

Weidong Zheng , Kongyang Chen , Yuanwei Guo , Yatie Xiao

Authors on Pith no claims yet

Pith reviewed 2026-05-12 04:02 UTC · model grok-4.3

classification 💻 cs.LG cs.CR

keywords class-level machine unlearningclassification head biasbias mitigationunlearning evaluationsoftmax cross-entropyforget set accuracy

0 comments

The pith

The prediction of forgotten classes can be suppressed by decreasing bias terms in the final classification head.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Class-level machine unlearning removes specified classes from a trained model while retaining utility on other classes. The paper reveals that this forgetting often occurs through a shortcut in the output layer rather than deeper changes. Under standard retain-set training with softmax cross-entropy, gradient dynamics naturally lower the bias values for classes absent from the data, which suppresses their predictions. The authors introduce BiasShift to show how simple bias adjustment can pass common unlearning metrics yet leave detectable abnormal patterns. They then propose two bias-control methods and three new metrics to quantify and reduce this dependence on head biases.

Core claim

Retain-set-only optimization tends to reduce the biases of absent classes because of gradient flow under softmax cross-entropy, allowing the prediction of forgotten classes to be suppressed simply by decreasing the corresponding bias terms in the classification head; BiasShift demonstrates this shortcut while TS-BGRM and LB-HR mitigate it and BSC, MBG, MBS track the resulting bias stability.

What carries the argument

Bias terms in the final classification head, whose reduction under retain-set optimization suppresses predictions for absent classes via analyzed softmax cross-entropy gradient dynamics.

Load-bearing premise

Retain-set-only optimization tends to reduce the biases of absent classes due to the analyzed gradient dynamics under softmax cross-entropy.

What would settle it

Measure the change in classification-head bias values for absent classes after retain-set-only training on a pre-trained model and check whether they decrease substantially.

Figures

Figures reproduced from arXiv: 2605.08730 by Kongyang Chen, Weidong Zheng, Yatie Xiao, Yuanwei Guo.

**Figure 2.** Figure 2: Magnitude of the bias terms after applying the [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗

**Figure 4.** Figure 4: Overview of the proposed framework. BiasShift exposes a bias-dominated shortcut in class-level unlearning, while [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Illustration of the Two-Stage Bias Gradient Reversal [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗

**Figure 6.** Figure 6: Classification-head bias analysis on CIFAR-10 after class-level forgetting. Subfigures (a)–(b) show the three-class [PITH_FULL_IMAGE:figures/full_fig_p012_6.png] view at source ↗

**Figure 7.** Figure 7: Classification-head bias analysis on CIFAR-100 after class-level forgetting. For clarity, only the first 10 class heads [PITH_FULL_IMAGE:figures/full_fig_p013_7.png] view at source ↗

**Figure 8.** Figure 8: Classification-head bias analysis on Tiny-ImageNet after class-level forgetting. For clarity, only the first 10 class [PITH_FULL_IMAGE:figures/full_fig_p013_8.png] view at source ↗

read the original abstract

Class-level machine unlearning aims to remove the influence of specified classes while preserving model utility on retained classes. Existing methods are commonly evaluated by retain-set accuracy, forget-set accuracy, and unlearning time, but these metrics provide limited insight into how forgetting is achieved internally. In this paper, we reveal a bias-dominated shortcut in class-level unlearning: the prediction of forgotten classes can be suppressed by decreasing the corresponding bias terms in the final classification head. We first analyze the gradient dynamics of classification-head biases under softmax cross-entropy training, explaining why retain-set-only optimization tends to reduce the biases of absent classes. Based on this observation, we introduce BiasShift as a diagnostic baseline, showing that simple bias manipulation can satisfy conventional unlearning metrics while leaving abnormal bias patterns that reveal forgotten labels. To mitigate excessive forgotten-class bias suppression, we propose two bias-aware mechanisms, namely Two-Stage Bias Gradient Reversal Mechanism (TS-BGRM) and Lower-Bound Hinge Regularization (LB-HR). We further introduce three bias-oriented metrics, including Bias Stability Coefficient (BSC), Median Bias Gap (MBG), and Minimal Bias Score (MBS), to quantify bias dependence and potential leakage. Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet demonstrate that the proposed methods maintain competitive unlearning performance while producing more stable bias distributions. We have released our code at {https://github.com/zwd2024/Beyond-the-Shadow-of-Bias-From-Classification-Head-Bias-to-Parameter-Redistribution}.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that class-level unlearning often reduces to suppressing biases in the final classification head, and supplies a gradient-based diagnosis plus practical counters.

read the letter

The main takeaway is that retain-set-only training under softmax cross-entropy lowers the bias terms for absent classes because their gradient equals the predicted probability, which is positive. This shortcut lets many unlearning methods pass retain and forget accuracy checks while leaving abnormal bias patterns that can still reveal the forgotten labels. The authors formalize this with the BiasShift baseline, which manipulates only the head biases and satisfies standard metrics. They then add TS-BGRM for staged gradient reversal and LB-HR for a hinge lower bound on biases, along with BSC, MBG, and MBS to measure how much performance depends on bias shifts rather than feature changes. The gradient derivation is direct and matches the math, and the CIFAR-10/100 and Tiny-ImageNet results show their methods keep competitive accuracy while producing flatter bias distributions. Code release helps verification. The soft spots are limited: the new mechanisms have a few tunable thresholds and strengths that need setting, and the bias metrics would benefit from tighter links to actual leakage attacks. The work stays on image classifiers, so broader tests on other architectures are missing. This is useful for anyone building or auditing unlearning pipelines. It reflects clear thinking on an evaluation gap and deserves peer review.

Referee Report

0 major / 4 minor

Summary. The paper claims that class-level machine unlearning frequently exploits a bias-dominated shortcut: predictions for forgotten classes are suppressed simply by decreasing the corresponding bias terms in the final classification head. It derives this from the gradient dynamics of softmax cross-entropy, where the bias update for an absent class equals p_c (strictly positive under retain-set-only optimization), independent of feature-extractor changes. The authors introduce BiasShift as a diagnostic baseline that satisfies standard retain/forget accuracy metrics via bias manipulation alone while exposing abnormal bias patterns, propose TS-BGRM and LB-HR to limit excessive bias suppression, and define three new bias-oriented metrics (BSC, MBG, MBS). Experiments on CIFAR-10, CIFAR-100, and Tiny-ImageNet show competitive unlearning performance with more stable bias distributions; code is released.

Significance. If the central observation holds, the work is significant for exposing a mechanistic shortcut that conventional metrics overlook, thereby motivating more robust evaluation and mitigation in unlearning. Strengths include the direct, parameter-free gradient analysis (bias gradient = p_c - y_c) that explains the phenomenon without circular fitting, the empirical demonstration that BiasShift alone can satisfy existing metrics, the introduction of bias-aware mitigations and metrics, and the public code release for reproducibility. This could shift unlearning research toward internal-mechanism diagnostics rather than surface-level accuracy checks.

minor comments (4)

The definitions and exact computation of the new metrics BSC, MBG, and MBS (introduced after the mitigation methods) should include explicit formulas or pseudocode in the main text or appendix to ensure immediate reproducibility.
The experimental section would benefit from reporting standard deviations or confidence intervals across multiple random seeds for both accuracy and the proposed bias metrics, as single-run results limit assessment of stability claims.
The hyperparameter choices for the free parameters in TS-BGRM (stage thresholds, reversal strengths) and LB-HR (hinge lower-bound) are listed but their sensitivity analysis or selection procedure could be expanded for clarity.
Figure captions and axis labels for bias-distribution plots should explicitly reference the new metrics (BSC/MBG/MBS) to connect visuals directly to the quantitative claims.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for the positive evaluation of our work, the recognition of its significance in exposing the bias-dominated shortcut in class-level unlearning, and the recommendation for minor revision. We appreciate the acknowledgment of the gradient analysis, BiasShift diagnostic, proposed mitigations, new metrics, and code release.

Circularity Check

0 steps flagged

No significant circularity; derivation is self-contained

full rationale

The paper's central analysis derives the bias gradient as p_c - y_c under softmax cross-entropy and shows that retain-set-only optimization reduces absent-class biases because the update is independent of features when y_c=0. This follows directly from standard loss mathematics with no fitting to target results or self-referential definitions. BiasShift is introduced as an explicit diagnostic baseline that satisfies conventional metrics via bias manipulation alone, exposing the shortcut rather than predicting it. The proposed TS-BGRM, LB-HR, and new metrics (BSC, MBG, MBS) are defined independently to quantify and mitigate bias dependence without reducing to any fitted quantities or prior self-citations. No load-bearing step collapses to an input by construction, and the argument remains externally falsifiable via the released code and standard benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central claim rests on standard assumptions of gradient flow in classification heads plus a small number of tunable hyperparameters in the proposed mechanisms; no new physical entities or ad-hoc constants are introduced.

free parameters (2)

stage thresholds and reversal strengths in TS-BGRM
Tunable parameters controlling when and how strongly bias gradients are reversed during the two-stage process.
hinge lower-bound value in LB-HR
Chosen threshold that prevents excessive negative bias shifts for forgotten classes.

axioms (1)

domain assumption Gradient dynamics of classification-head biases under softmax cross-entropy training cause retain-set-only optimization to reduce biases of absent classes
Invoked to explain the observed shortcut; treated as a derived property of standard loss rather than proved from first principles in the abstract.

pith-pipeline@v0.9.0 · 5590 in / 1225 out tokens · 41199 ms · 2026-05-12T04:02:20.333429+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

35 extracted references · 35 canonical work pages · 3 internal anchors

[1]

General data protection regulation (gdpr),

European Union, “General data protection regulation (gdpr),” 2016. [Online]. Available: https://eur-lex.europa.eu/ legal-content/EN/TXT/PDF/?uri=CELEX:32016R0679

work page 2016
[2]

California consumer privacy act (ccpa),

California Department of Justice, “California consumer privacy act (ccpa),” 2018. [Online]. Available: https://oag.ca.gov/privacy/ ccpa

work page 2018
[3]

Data security law of the people’s republic of china,

Standing Committee of the National People’s Congress, “Data security law of the people’s republic of china,” National People’s Congress Website, 2021, [Online; accessed 2023-10- 01]. [Online]. Available: http://www.npc.gov.cn/npc/c2/c30834/ 202106/t20210610_311888.html

work page 2021
[4]

Class Unlearning via Depth-Aware Removal of Forget-Specific Directions

A. Hatami, R. Aalishah, and I. E. Monosov, “Class unlearning via depth-aware removal of forget-specific directions,”arXiv preprint arXiv:2604.15166, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[5]

Partially blinded unlearning: Class unlearning for deep networks from bayesian perspective,

S. Panda, S. Souravet al., “Partially blinded unlearning: Class unlearning for deep networks from bayesian perspective,” in Proceedings of the AAAI Conference on Artificial Intelligence, vol. 39, no. 6, 2025, pp. 6372–6380

work page 2025
[6]

An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations

Y. Gao, A. Unal, A. Rangamani, and Z. Zhu, “An illusion of unlearning? assessing machine unlearning through internal rep- resentations,”arXiv preprint arXiv:2604.08271, 2026

work page internal anchor Pith review Pith/arXiv arXiv 2026
[7]

Machine unlearning on pre-trained models by residual feature alignment using lora,

L. Qin, T. Zhu, L. Wang, and W. Zhou, “Machine unlearning on pre-trained models by residual feature alignment using lora,” IEEE Transactions on Dependable and Secure Computing, 2026

work page 2026
[8]

Fast yet effective machine unlearning,

A. K. Tarun, V . S. Chundawat, M. Mandal, and M. Kankanhalli, “Fast yet effective machine unlearning,”IEEE transactions on neural networks and learning systems, vol. 35, no. 9, pp. 13 046–13 055, 2023

work page 2023
[9]

Can bad teaching induce forgetting? unlearning in deep net- works using an incompetent teacher,

V . S. Chundawat, A. K. Tarun, M. Mandal, and M. Kankanhalli, “Can bad teaching induce forgetting? unlearning in deep net- works using an incompetent teacher,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, no. 6, 2023, pp. 7210– 7217

work page 2023
[10]

Decoupled distillation to erase: A general unlearning method for any class-centric tasks,

Y. Zhou, D. Zheng, Q. Mo, R. Lu, K.-Y. Lin, and W.-S. Zheng, “Decoupled distillation to erase: A general unlearning method for any class-centric tasks,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 20 350–20 359

work page 2025
[11]

Zero-shot machine unlearning,

V . S. Chundawat, A. K. Tarun, M. Mandal, and M. Kankanhalli, “Zero-shot machine unlearning,”IEEE Transactions on Information Forensics and Security, vol. 18, pp. 2345–2354, 2023

work page 2023
[12]

Machine unlearn- ing,

L. Bourtoule, V . Chandrasekaran, C. A. Choquette-Choo, H. Jia, A. Travers, B. Zhang, D. Lie, and N. Papernot, “Machine unlearn- ing,” in2021 IEEE symposium on security and privacy (SP). IEEE, 2021, pp. 141–159

work page 2021
[13]

Arcane: An efficient architecture for exact machine unlearning

H. Yan, X. Li, Z. Guo, H. Li, F. Li, and X. Lin, “Arcane: An efficient architecture for exact machine unlearning.” inIjcai, vol. 6, 2022, p. 19

work page 2022
[14]

Understanding black-box predictions via influence functions,

P . W. Koh and P . Liang, “Understanding black-box predictions via influence functions,” inInternational conference on machine learning. PMLR, 2017, pp. 1885–1894

work page 2017
[15]

Our data, ourselves: Privacy via distributed noise generation,

C. Dwork, K. Kenthapadi, F. McSherry, I. Mironov, and M. Naor, “Our data, ourselves: Privacy via distributed noise generation,” inAnnual international conference on the theory and applications of cryptographic techniques. Springer, 2006, pp. 486–503

work page 2006
[16]

Puma: Performance unchanged model augmentation for training data removal,

G. Wu, M. Hashemi, and C. Srinivasa, “Puma: Performance unchanged model augmentation for training data removal,” in Proceedings of the AAAI conference on artificial intelligence, vol. 36, no. 8, 2022, pp. 8675–8682

work page 2022
[17]

Eternal sunshine of the spotless net: Selective forgetting in deep networks,

A. Golatkar, A. Achille, and S. Soatto, “Eternal sunshine of the spotless net: Selective forgetting in deep networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2020, pp. 9304–9312

work page 2020
[18]

Mixed-privacy forgetting in deep networks,

A. Golatkar, A. Achille, A. Ravichandran, M. Polito, and S. Soatto, “Mixed-privacy forgetting in deep networks,” inProceedings of the IEEE/CVF conference on computer vision and pattern recognition, 2021, pp. 792–801

work page 2021
[19]

Accurate and fast machine unlearning with hessian- guided overfitting approximation,

W. Zheng, W. Zhang, K. Chen, T. Liang, F. Yang, H. Lu, and Y. Pang, “Accurate and fast machine unlearning with hessian- guided overfitting approximation,”Neurocomputing, p. 133369, 2026

work page 2026
[20]

Model sparsity can simplify machine unlearning,

J. Jia, J. Liu, P . Ram, Y. Yao, G. Liu, Y. Liu, P . Sharma, and S. Liu, “Model sparsity can simplify machine unlearning,”Advances in Neural Information Processing Systems, vol. 36, pp. 51 584–51 605, 2023

work page 2023
[21]

Towards machine unlearning benchmarks: Forgetting the personal identities in facial recognition systems,

D. Choi and D. Na, “Towards machine unlearning benchmarks: Forgetting the personal identities in facial recognition systems,” arXiv preprint arXiv:2311.02240, 2023

work page arXiv 2023
[22]

arXiv preprint arXiv:2310.12508 (2023)

C. Fan, J. Liu, Y. Zhang, E. Wong, D. Wei, and S. Liu, “Salun: Em- powering machine unlearning via gradient-based weight saliency in both image classification and generation,”arXiv preprint arXiv:2310.12508, 2023

work page arXiv 2023
[23]

To- wards unbounded machine unlearning,

M. Kurmanji, P . Triantafillou, J. Hayes, and E. Triantafillou, “To- wards unbounded machine unlearning,”Advances in neural infor- mation processing systems, vol. 36, pp. 1957–1987, 2023

work page 1957
[24]

Fast machine unlearning without retraining through selective synaptic dampening,

J. Foster, S. Schoepf, and A. Brintrup, “Fast machine unlearning without retraining through selective synaptic dampening,” in Proceedings of the AAAI conference on artificial intelligence, vol. 38, no. 11, 2024, pp. 12 043–12 051

work page 2024
[25]

Machine unlearning for random forests,

J. Brophy and D. Lowd, “Machine unlearning for random forests,” inInternational conference on machine learning. PMLR, 2021, pp. 1092–1104

work page 2021
[26]

Federated unlearning via class- discriminative pruning,

J. Wang, S. Guo, X. Xie, and H. Qi, “Federated unlearning via class- discriminative pruning,” inProceedings of the ACM web conference 2022, 2022, pp. 622–632

work page 2022
[27]

Blockful: Enabling unlearning in blockchained federated learning,

X. Liu, M. Li, G. Yu, X. Wang, W. Ni, L. Li, H. Peng, and R. P . Liu, “Blockful: Enabling unlearning in blockchained federated learning,”IEEE Transactions on Information Forensics and Security, 2025

work page 2025
[28]

Fashion-MNIST: a Novel Image Dataset for Benchmarking Machine Learning Algorithms

H. Xiao, K. Rasul, and R. Vollgraf, “Fashion-mnist: a novel image dataset for benchmarking machine learning algorithms,”arXiv preprint arXiv:1708.07747, 2017

work page internal anchor Pith review Pith/arXiv arXiv 2017
[29]

Graph unlearning,

M. Chen, Z. Zhang, T. Wang, M. Backes, M. Humbert, and Y. Zhang, “Graph unlearning,” inProceedings of the 2022 ACM SIGSAC conference on computer and communications security, 2022, pp. 499–513

work page 2022
[30]

Single image unlearning: Efficient machine unlearning in mul- timodal large language models,

J. Li, Q. Wei, C. Zhang, G. Qi, M. Du, Y. Chen, S. Bi, and F. Liu, “Single image unlearning: Efficient machine unlearning in mul- timodal large language models,”Advances in Neural Information Processing Systems, vol. 37, pp. 35 414–35 453, 2024

work page 2024
[31]

Large language model unlearning,

Y. Yao and X. Xu, “Large language model unlearning,”Advances in Neural Information Processing Systems, vol. 37, pp. 105 425–105 475, 2024

work page 2024
[32]

Machine unlearning of pre-trained large language models,

J. Yao, E. Chien, M. Du, X. Niu, T. Wang, Z. Cheng, and X. Yue, “Machine unlearning of pre-trained large language models,” in Proceedings of the 62nd annual meeting of the association for computa- tional linguistics (volume 1: Long papers), 2024, pp. 8403–8419. 16

work page 2024
[33]

Rethinking machine unlearning for large language models,

S. Liu, Y. Yao, J. Jia, S. Casper, N. Baracaldo, P . Hase, Y. Yao, C. Y. Liu, X. Xu, H. Liet al., “Rethinking machine unlearning for large language models,”Nature Machine Intelligence, vol. 7, no. 2, pp. 181–194, 2025

work page 2025
[34]

Cifar-10 (canadian institute for advanced research),

A. Krizhevsky, V . Nair, and G. Hinton, “Cifar-10 (canadian institute for advanced research),”URL http://www. cs. toronto. edu/kriz/cifar. html, vol. 5, no. 4, p. 1, 2010

work page 2010
[35]

Deep residual learning for image recognition,

K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recognition,” inProceedings of the IEEE conference on computer vision and pattern recognition, 2016, pp. 770–778

work page 2016