arxiv: 2604.08271 · v1 · submitted 2026-04-09 · 💻 cs.LG

Recognition: no theorem link

An Illusion of Unlearning? Assessing Machine Unlearning Through Internal Representations

Yichen Gao , Altay Unal , Akshay Rangamani , Zhihui Zhu

Authors on Pith no claims yet

Pith reviewed 2026-05-10 17:59 UTC · model grok-4.3

classification 💻 cs.LG

keywords machine unlearningfeature-classifier misalignmentneural collapselinear probinginternal representationsclass-mean features classifier

0 comments

The pith

Machine unlearning methods succeed largely by misaligning the classifier from the model's learned features.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper investigates why many machine unlearning techniques seem effective at the output level but may not truly erase information from the model. It reveals that success often stems from a mismatch between the internal features and the classifier, leaving the features themselves highly informative about the forgotten data. Simple methods like linear probing can recover high accuracy on forgotten classes from these representations. The authors further show that, assuming neural collapse, merely adjusting the classifier achieves unlearning-like effects while preserving performance on retained data. This motivates new unlearning approaches that enforce alignment between features and classifiers to achieve more faithful erasure at the representation level.

Core claim

Many state-of-the-art machine unlearning methods appear successful mainly due to a misalignment between last-layer features and the classifier. Hidden features remain highly discriminative, and simple linear probing can recover near-original accuracy. Adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, as demonstrated through classifier-only fine-tuning experiments. New methods based on a class-mean features classifier explicitly enforce alignment and reduce forgotten information in representations.

What carries the argument

Feature-classifier misalignment, where last-layer features stay discriminative for forgotten data even as the classifier is adjusted to output low accuracy on them; the class-mean features (CMF) classifier enforces alignment between features and decision boundaries.

If this is right

Linear probing recovers near-original accuracy on forgotten data in unlearned models.
Classifier-only fine-tuning can achieve effective unlearning without modifying the feature extractor.
CMF-based unlearning maintains high retain accuracy while reducing information in hidden features about forgotten classes.
Evaluation of unlearning must include representation-level checks beyond output behavior alone.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

True machine unlearning likely requires changes to the feature extraction layers rather than just the output head.
This finding highlights potential vulnerabilities in relying solely on output metrics for assessing privacy or forgetting.
Extending the analysis to models without neural collapse could test whether misalignment is the dominant factor more generally.
Unlearning techniques might need to directly degrade the discriminative power of internal representations.

Load-bearing premise

The original model exhibits neural collapse so that features form tight clusters around class means and the classifier can be adjusted independently to control forgetting.

What would settle it

An experiment showing that linear probing on the last-layer features of an unlearned model yields only low accuracy on forgotten data would indicate that the features have actually been altered beyond mere misalignment.

Figures

Figures reproduced from arXiv: 2604.08271 by Akshay Rangamani, Altay Unal, Yichen Gao, Zhihui Zhu.

**Figure 2.** Figure 2: Visualization of Proposition 1. We observe [PITH_FULL_IMAGE:figures/full_fig_p002_2.png] view at source ↗

**Figure 4.** Figure 4: Learning curve of Random-label unlearning on CIFAR-10 when forgetting class 0 (airplane). While the output-level forget accuracy drops to zero quickly, the linear-probe and NCC accuracies remain consistently high throughout the unlearning process. trend is observed under the NCC accuracy indicating that the “forgotten” representations still cluster around their class means. This implies that while curre… view at source ↗

**Figure 5.** Figure 5: Feature-classifier Alignment (N C3) for single class forgetting on CIFAR-10: distance between class means and classifier weights for forget class is increased while the distance is preserved for retain class. less transferable. For example, NegGrad+ achieves lower forget accuracy on CIFAR-10 when forgetting one class, albeit with a slight drop in retain accuracy. Nevertheless, as shown in [PITH_FULL_IMAG… view at source ↗

**Figure 6.** Figure 6: t-SNE of features learned with CMF-based unlearning methods. The forgotten class (red points) [PITH_FULL_IMAGE:figures/full_fig_p009_6.png] view at source ↗

read the original abstract

While numerous machine unlearning (MU) methods have recently been developed with promising results in erasing the influence of forgotten data, classes, or concepts, they are also highly vulnerable-for example, simple fine-tuning can inadvertently reintroduce erased concepts. In this paper, we address this contradiction by examining the internal representations of unlearned models, in contrast to prior work that focuses primarily on output-level behavior. Our analysis shows that many state-of-the-art MU methods appear successful mainly due to a misalignment between last-layer features and the classifier, a phenomenon we call feature-classifier misalignment. In fact, hidden features remain highly discriminative, and simple linear probing can recover near-original accuracy. Assuming neural collapse in the original model, we further demonstrate that adjusting only the classifier can achieve negligible forget accuracy while preserving retain accuracy, and we corroborate this with experiments using classifier-only fine-tuning. Motivated by these findings, we propose MU methods based on a class-mean features (CMF) classifier, which explicitly enforces alignment between features and classifiers. Experiments on standard benchmarks show that CMF-based unlearning reduces forgotten information in representations while maintaining high retain accuracy, highlighting the need for faithful representation-level evaluation of MU.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows that standard unlearning often just detaches the classifier from still-informative features, and proposes a class-mean features head to fix the alignment.

read the letter

The core observation is that many machine unlearning methods look successful at the output level mainly because they break the match between the last-layer features and the final classifier. The features themselves stay discriminative, so a linear probe recovers most of the original accuracy on forgotten data. That is the new angle: prior work mostly checked outputs or membership inference, while this one looks inside at representation quality.

Referee Report

2 major / 1 minor

Summary. The paper claims that many state-of-the-art machine unlearning (MU) methods succeed primarily due to a misalignment between last-layer features and the classifier (termed feature-classifier misalignment), with hidden features remaining highly discriminative as evidenced by linear probing recovering near-original accuracy. Assuming neural collapse in the original model, it further shows that adjusting only the classifier suffices to achieve negligible forget accuracy while preserving retain accuracy, corroborated via classifier-only fine-tuning experiments. Motivated by this, the authors propose MU methods based on a class-mean features (CMF) classifier that enforces alignment, with experiments on standard benchmarks demonstrating reduced forgotten information in representations alongside high retain accuracy.

Significance. If the misalignment observation and CMF results hold, the work is significant for shifting MU evaluation toward representation-level analysis and exposing potential over-reliance on output metrics in prior methods. It offers a mechanistic explanation for why simple fine-tuning can reintroduce forgotten concepts and introduces a new class of alignment-enforcing unlearning approaches. Credit is given for the empirical support via linear probing and classifier-only fine-tuning on standard benchmarks, as well as for highlighting the need for faithful internal-representation evaluation of MU.

major comments (2)

[Abstract / classifier-adjustment demonstration] The demonstration that adjusting only the classifier achieves negligible forget accuracy while preserving retain accuracy (Abstract) explicitly assumes neural collapse in the original model. This assumption is load-bearing for interpreting the classifier-only fine-tuning experiment as corroboration of the misalignment mechanism, yet the manuscript does not verify whether the backbone models (e.g., ResNet/VGG on CIFAR-10/100) exhibit class-mean collapse to simplex vertices and vanishing within-class variability under the training regimes used. If collapse does not hold, the theoretical justification weakens and the result risks becoming an ad-hoc observation rather than evidence for the proposed mechanism.
[Linear probing experiments] The central claim that hidden features remain highly discriminative (supported by linear probing recovering near-original accuracy) is load-bearing for the misalignment thesis, but the manuscript provides insufficient detail on the probing protocol: whether probes are trained on the same data splits as the original model or on independent benchmarks, the number of runs and error bars, exact recovered accuracies versus the original model, and any data-exclusion rules. Without these, the strength of the 'near-original' recovery and its contrast to MU outputs cannot be fully assessed.

minor comments (1)

[Abstract] The abstract contains a minor typographical issue: 'vulnerable-for example' lacks a space after the hyphen.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and positive assessment of the significance of our work on feature-classifier misalignment in machine unlearning. We address each major comment point by point below, with plans to revise the manuscript accordingly.

read point-by-point responses

Referee: [Abstract / classifier-adjustment demonstration] The demonstration that adjusting only the classifier achieves negligible forget accuracy while preserving retain accuracy (Abstract) explicitly assumes neural collapse in the original model. This assumption is load-bearing for interpreting the classifier-only fine-tuning experiment as corroboration of the misalignment mechanism, yet the manuscript does not verify whether the backbone models (e.g., ResNet/VGG on CIFAR-10/100) exhibit class-mean collapse to simplex vertices and vanishing within-class variability under the training regimes used. If collapse does not hold, the theoretical justification weakens and the result risks becoming an ad-hoc observation rather than evidence for the proposed mechanism.

Authors: We appreciate the referee pointing out the importance of verifying the neural collapse assumption. The theoretical motivation for the classifier-adjustment demonstration does invoke neural collapse to explain why misalignment can occur, but the classifier-only fine-tuning experiment is presented as an independent empirical corroboration that does not strictly require perfect collapse. To strengthen the mechanistic link and address the concern directly, the revised manuscript will include a new analysis section verifying the degree of neural collapse in our original models. This will report metrics such as the alignment of class-mean features to simplex vertices and within-class variability for ResNet and VGG on CIFAR-10/100 under the exact training regimes used. revision: yes
Referee: [Linear probing experiments] The central claim that hidden features remain highly discriminative (supported by linear probing recovering near-original accuracy) is load-bearing for the misalignment thesis, but the manuscript provides insufficient detail on the probing protocol: whether probes are trained on the same data splits as the original model or on independent benchmarks, the number of runs and error bars, exact recovered accuracies versus the original model, and any data-exclusion rules. Without these, the strength of the 'near-original' recovery and its contrast to MU outputs cannot be fully assessed.

Authors: We agree that additional details on the linear probing protocol are necessary to fully substantiate the central claim. In the revised manuscript, we will expand the experimental section to specify: that probes are trained on the same data splits as the original model (with explicit train/validation/test divisions and no overlap with forget sets), the number of independent runs (five runs with different random seeds for initialization), standard deviations or error bars on all reported accuracies, the precise recovered accuracies relative to the original model's performance on both retain and forget classes, and confirmation that probing excludes any forget-set data to prevent leakage. These clarifications will allow readers to better evaluate the strength of the near-original recovery results. revision: yes

Circularity Check

0 steps flagged

No circularity; empirical probing and external assumption keep derivation self-contained

full rationale

The paper's central analysis of feature-classifier misalignment rests on linear probing experiments that directly measure retained discriminability in hidden layers on independent benchmarks, without any fitted parameter being relabeled as a prediction. The neural collapse assumption is invoked explicitly from prior literature to motivate the classifier-only adjustment demonstration and is corroborated by separate fine-tuning experiments rather than derived from the paper's own results or self-citations. No equation or claim reduces to its inputs by construction, and the proposed CMF classifier is a new method rather than a renaming or self-referential fit.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the neural collapse assumption drawn from prior work and introduces a new classifier construction whose independent verification is limited to the reported experiments.

axioms (1)

domain assumption neural collapse in the original model
Invoked to show that classifier-only adjustment suffices for output-level unlearning while preserving retain accuracy.

invented entities (1)

class-mean features (CMF) classifier no independent evidence
purpose: To explicitly enforce alignment between internal features and the classifier during unlearning
Proposed as a new method to reduce forgotten information at the representation level.

pith-pipeline@v0.9.0 · 5515 in / 1355 out tokens · 42806 ms · 2026-05-10T17:59:10.530002+00:00 · methodology

discussion (0)

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Classification-Head Bias in Class-Level Machine Unlearning: Diagnosis, Mitigation, and Evaluation
cs.LG 2026-05 conditional novelty 7.0

Class-level unlearning shortcuts via bias suppression in the classification head; new bias-aware training mechanisms and bias-specific metrics are introduced to diagnose and reduce this dependence.

Reference graph

Works this paper leans on

5 extracted references · cited by 1 Pith paper

[1]

[Yes] See Sections 2.1 and 4.3 for the main de- scription

For all models and algorithms presented, check if you include: (a) A clear description of the mathematical set- ting, assumptions, algorithm, and/or model. [Yes] See Sections 2.1 and 4.3 for the main de- scription. Algorithm 1, Algorithm 2 in Ap- pendix B provide detailed pseudocode for the CMF unlearning strategies. (b) An analysis of the properties and ...
[2]

[Yes] See Section 2.1 for assumptions on Neural Collapse and class mean features, and Sec- tion 4.3 for assumptions underlying the un- learning analysis

For any theoretical claim, check if you include: (a) Statements of the full set of assumptions of all theoretical results. [Yes] See Section 2.1 for assumptions on Neural Collapse and class mean features, and Sec- tion 4.3 for assumptions underlying the un- learning analysis. (b) Complete proofs of all theoretical results. [Yes] (c) Clear explanations of ...
[3]

[Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen)

For all figures and tables that present empirical results, check if you include: (a) The code, data, and instructions needed to reproduce the main experimental results (ei- ther in the supplemental material or as a URL). [Yes] (b) All the training details (e.g., data splits, hy- perparameters, how they were chosen). [Yes] (c) A clear definition of the spe...
[4]

[Yes] (b) The license information of the assets, if ap- plicable

If you are using existing assets (e.g., code, data, models) or curating/releasing new assets, check if you include: (a) Citations of the creator If your work uses ex- isting assets. [Yes] (b) The license information of the assets, if ap- plicable. [Yes] (c) New assets either in the supplemental mate- rial or as a URL, if applicable. [Yes] (d) Information ...
[5]

gold standard

If you used crowdsourcing or conducted research with human subjects, check if you include: (a) The full text of instructions given to partici- pants and screenshots. [Not Applicable] (b) Descriptions of potential participant risks, with links to Institutional Review Board (IRB) approvals if applicable. [Not Appli- cable] (c) The estimated hourly wage paid...

2020