Recognition: no theorem link
An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations
Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3
The pith
Machine unlearning methods succeed largely by misaligning the classifier from the model's learned features.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Many state-of-the-art machine unlearning methods appear successful mainly due to a misalignment between last-layer features and the classifier. Hidden features remain highly discriminative, and simple linear probing can recover near-original accuracy. Adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, as demonstrated through classifier-only fine-tuning experiments. New methods based on a class-mean features classifier explicitly enforce alignment and reduce forgotten information in representations.
What carries the argument
Feature-classifier misalignment, where last-layer features stay discriminative for forgotten data even as the classifier is adjusted to output low accuracy on them; the class-mean features (CMF) classifier enforces alignment between features and decision boundaries.
If this is right
- Linear probing recovers near-original accuracy on forgotten data in unlearned models.
- Classifier-only fine-tuning can achieve effective unlearning without modifying the feature extractor.
- CMF-based unlearning maintains high retain accuracy while reducing information in hidden features about forgotten classes.
- Evaluation of unlearning must include representation-level checks beyond output behavior alone.
Where Pith is reading between the lines
- True machine unlearning likely requires changes to the feature extraction layers rather than just the output head.
- This finding highlights potential vulnerabilities in relying solely on output metrics for assessing privacy or forgetting.
- Extending the analysis to models without neural collapse could test whether misalignment is the dominant factor more generally.
- Unlearning techniques might need to directly degrade the discriminative power of internal representations.
Load-bearing premise
The original model exhibits neural collapse so that features form tight clusters around class means and the classifier can be adjusted independently to control forgetting.
What would settle it
An experiment showing that linear probing on the last-layer features of an unlearned model yields only low accuracy on forgotten data would indicate that the features have actually been altered beyond mere misalignment.
Figures
read the original abstract
While numerous machine unlearning (MU) methods have recently been developed with promising results in erasing the influence of forgotten data, classes, or concepts, they are also highly vulnerable-for example, simple fine-tuning can inadvertently reintroduce erased concepts. In this paper, we address this contradiction by examining the internal representations of unlearned models, in contrast to prior work that focuses primarily on output-level behavior. Our analysis shows that many state-of-the-art MU methods appear successful mainly due to a misalignment between last-layer features and the classifier, a phenomenon we call feature-classifier misalignment. In fact, hidden features remain highly discriminative, and simple linear probing can recover near-original accuracy. Assuming neural collapse in the original model, we further demonstrate that adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, and we corroborate this with experiments using classifier-only fine-tuning. Motivated by these findings, we propose MU methods based on a class-mean features (CMF) classifier, which explicitly enforces alignment between features and classifiers. Experiments on standard benchmarks show that CMF-based unlearning reduces forgotten information in representations while maintaining high retain accuracy, highlighting the need for faithful representation-level evaluation of MU.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that many state-of-the-art machine unlearning (MU) methods succeed primarily due to a misalignment between last-layer features and the classifier (termed feature-classifier misalignment), with hidden features remaining highly discriminative as evidenced by linear probing recovering near-original accuracy. Assuming neural collapse in the original model, it further shows that adjusting only the classifier suffices to achieve negligible forget accuracy while preserving retain accuracy, corroborated via classifier-only fine-tuning experiments. Motivated by this, the authors propose MU methods based on a class-mean features (CMF) classifier that enforces alignment, with experiments on standard benchmarks demonstrating reduced forgotten information in representations alongside high retain accuracy.
Significance. If the misalignment observation and CMF results hold, the work is significant for shifting MU evaluation toward representation-level analysis and exposing potential over-reliance on output metrics in prior methods. It offers a mechanistic explanation for why simple fine-tuning can reintroduce forgotten concepts and introduces a new class of alignment-enforcing unlearning approaches. Credit is given for the empirical support via linear probing and classifier-only fine-tuning on standard benchmarks, as well as for highlighting the need for faithful internal-representation evaluation of MU.
major comments (2)
- [Abstract / classifier-adjustment demonstration] The demonstration that adjusting only the classifier achieves negligible forget accuracy while preserving retain accuracy (Abstract) explicitly assumes neural collapse in the original model. This assumption is load-bearing for interpreting the classifier-only fine-tuning experiment as corroboration of the misalignment mechanism, yet the manuscript does not verify whether the backbone models (e.g., ResNet/VGG on CIFAR-10/100) exhibit class-mean collapse to simplex vertices and vanishing within-class variability under the training regimes used. If collapse does not hold, the theoretical justification weakens and the result risks becoming an ad-hoc observation rather than evidence for the proposed mechanism.
- [Linear probing experiments] The central claim that hidden features remain highly discriminative (supported by linear probing recovering near-original accuracy) is load-bearing for the misalignment thesis, but the manuscript provides insufficient detail on the probing protocol: whether probes are trained on the same data splits as the original model or on independent benchmarks, the number of runs and error bars, exact recovered accuracies versus the original model, and any data-exclusion rules. Without these, the strength of the 'near-original' recovery and its contrast to MU outputs cannot be fully assessed.
minor comments (1)
- [Abstract] The abstract contains a minor typographical issue: 'vulnerable-for example' lacks a space after the hyphen.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback and positive assessment of the significance of our work on feature-classifier misalignment in machine unlearning. We address each major comment point by point below, with plans to revise the manuscript accordingly.
read point-by-point responses
-
Referee: [Abstract / classifier-adjustment demonstration] The demonstration that adjusting only the classifier achieves negligible forget accuracy while preserving retain accuracy (Abstract) explicitly assumes neural collapse in the original model. This assumption is load-bearing for interpreting the classifier-only fine-tuning experiment as corroboration of the misalignment mechanism, yet the manuscript does not verify whether the backbone models (e.g., ResNet/VGG on CIFAR-10/100) exhibit class-mean collapse to simplex vertices and vanishing within-class variability under the training regimes used. If collapse does not hold, the theoretical justification weakens and the result risks becoming an ad-hoc observation rather than evidence for the proposed mechanism.
Authors: We appreciate the referee pointing out the importance of verifying the neural collapse assumption. The theoretical motivation for the classifier-adjustment demonstration does invoke neural collapse to explain why misalignment can occur, but the classifier-only fine-tuning experiment is presented as an independent empirical corroboration that does not strictly require perfect collapse. To strengthen the mechanistic link and address the concern directly, the revised manuscript will include a new analysis section verifying the degree of neural collapse in our original models. This will report metrics such as the alignment of class-mean features to simplex vertices and within-class variability for ResNet and VGG on CIFAR-10/100 under the exact training regimes used. revision: yes
-
Referee: [Linear probing experiments] The central claim that hidden features remain highly discriminative (supported by linear probing recovering near-original accuracy) is load-bearing for the misalignment thesis, but the manuscript provides insufficient detail on the probing protocol: whether probes are trained on the same data splits as the original model or on independent benchmarks, the number of runs and error bars, exact recovered accuracies versus the original model, and any data-exclusion rules. Without these, the strength of the 'near-original' recovery and its contrast to MU outputs cannot be fully assessed.
Authors: We agree that additional details on the linear probing protocol are necessary to fully substantiate the central claim. In the revised manuscript, we will expand the experimental section to specify: that probes are trained on the same data splits as the original model (with explicit train/validation/test divisions and no overlap with forget sets), the number of independent runs (five runs with different random seeds for initialization), standard deviations or error bars on all reported accuracies, the precise recovered accuracies relative to the original model's performance on both retain and forget classes, and confirmation that probing excludes any forget-set data to prevent leakage. These clarifications will allow readers to better evaluate the strength of the near-original recovery results. revision: yes
Circularity Check
No circularity; empirical probing and external assumption keep derivation self-contained
full rationale
The paper's central analysis of feature-classifier misalignment rests on linear probing experiments that directly measure retained discriminability in hidden layers on independent benchmarks, without any fitted parameter being relabeled as a prediction. The neural collapse assumption is invoked explicitly from prior literature to motivate the classifier-only adjustment demonstration and is corroborated by separate fine-tuning experiments rather than derived from the paper's own results or self-citations. No equation or claim reduces to its inputs by construction, and the proposed CMF classifier is a new method rather than a renaming or self-referential fit.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption neural collapse in the original model
invented entities (1)
-
class-mean features (CMF) classifier
no independent evidence
Forward citations
Cited by 1 Pith paper
-
Classification-Head Bias in Class-Level Machine Unlearning: Diagnosis, Mitigation, and Evaluation
Class-level unlearning shortcuts via bias suppression in the classification head; new bias-aware training mechanisms and bias-specific metrics are introduced to diagnose and reduce this dependence.
Reference graph
Works this paper leans on
-
[1]
[Yes] See Sections 2.1 and 4.3 for the main de- scription
For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. [Yes] See Sections 2.1 and 4.3 for the main de- scription. Algorithm 1, Algorithm 2 in Ap- pendix B provide detailed pseudocode for the CMF unlearning strategies. (b) An analysis of the properties and ...
-
[2]
[Yes] See Section 2.1 for assumptions on Neural Collapse and class mean features, and Sec- tion 4.3 for assumptions underlying the un- learning analysis
For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. [Yes] See Section 2.1 for assumptions on Neural Collapse and class mean features, and Sec- tion 4.3 for assumptions underlying the un- learning analysis. (b) Complete proofs of all theoretical results. [Yes] (c) Clear explanations of ...
-
[3]
[Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen)
For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to reproduce the main experimental results (ei- ther in the supplemental material or as a URL). [Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen). [Yes] (c) A clear definition of the spe...
-
[4]
[Yes] (b) The license information of the assets, if ap- plicable
If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses ex- isting assets. [Yes] (b) The license information of the assets, if ap- plicable. [Yes] (c) New assets either in the supplemental mate- rial or as a URL, if applicable. [Yes] (d) Information ...
-
[5]
gold standard
If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Appli- cable] (c) The estimated hourly wage paid...
2020
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.