MAAT: Multi-phase Adapter-Aware Targeted Unlearning

Aman Chadha; Amitava Das; Saksham Thakur; Shubham Gaur; Suryash Yagnik; Vinija Jain

arxiv: 2605.30514 · v1 · pith:RW7L76ATnew · submitted 2026-05-28 · 💻 cs.LG · cs.CL

MAAT: Multi-phase Adapter-Aware Targeted Unlearning

Suryash Yagnik , Shubham Gaur , Saksham Thakur , Vinija Jain , Aman Chadha , Amitava Das This is my paper

Pith reviewed 2026-06-29 08:14 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords machine unlearningcausal knowledgeWhy-type questionsLoRA adaptersforget-retain trade-offbenchmark evaluationadapter-aware unlearningmulti-phase framework

0 comments

The pith

MAAT is the first unlearning method to achieve both high forgetting and high retention on Why-type causal knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Benchmarks for machine unlearning contain almost no Why-type questions that test causal and relational facts, so methods can score well overall while failing on the hardest cases. The paper introduces a balanced 5,000-sample benchmark split evenly across Who, What, When, Where, and Why categories to make those failures measurable. It shows that existing approaches either forget too little on causal chains or damage retained knowledge when they try to forget aggressively. MAAT applies a three-phase process directly to LoRA adapter weights to reach a new point where both forgetting unwanted causal facts and retaining wanted knowledge succeed at high levels.

Core claim

No existing baseline simultaneously achieves high forgetting and high retention on Why-type questions because of multi-hop reasoning chains and long answer spans; MAAT, a three-phase framework operating on LoRA adapter weights that combines gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair, is the first method to do so and reaches a new operating point on the forget-retain Pareto frontier.

What carries the argument

Three-phase framework on LoRA adapter weights that combines gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair.

If this is right

Why-type causal knowledge can now be evaluated separately from easier factual recall in unlearning tasks.
Multi-hop reasoning chains in answers become targets that can be removed without broad degradation of retained knowledge.
Adapter-weight operations allow targeted changes that avoid the full retraining cost of other unlearning approaches.
A measurable new point exists on the forget-retain frontier that prior single-phase methods could not reach.
Causal facts can be unlearned while preserving performance on non-causal question types.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same phased adapter process could be tested on other adapter types or full fine-tuning to see if the balance holds without LoRA.
If the method generalizes, it would allow safer removal of harmful causal associations in deployed models without retraining from scratch.
Extending the 5W benchmark to include longer reasoning chains or cross-domain Why questions would test whether the reported gains persist.

Load-bearing premise

The three-phase combination on LoRA weights will produce the claimed balance on Why-type questions beyond the specific models and data splits used.

What would settle it

A new experiment on a different model or dataset in which MAAT fails to achieve both high forgetting and high retention on Why-type questions.

Figures

Figures reproduced from arXiv: 2605.30514 by Aman Chadha, Amitava Das, Saksham Thakur, Shubham Gaur, Suryash Yagnik, Vinija Jain.

**Figure 1.** Figure 1: Overview of MAAT ( multi-phase adapter-aware targeted unlearning ) architecture. sively on adapter matrices {Al , Bl}; base model weights remain frozen throughout. 4.1 Phase 1: Gradient-Projected Unlearning Standard gradient ascent applies a forget update gf uniformly. If gf has components aligned with the retain gradient gr, those components erode retained knowledge. MAAT removes this component via condit… view at source ↗

**Figure 2.** Figure 2: Ablation study on 5WBENCH (Llama 3.2-3B; 200 samples: 20 forget + 20 retain per label). FSR ↑, RSR ↑. A: Gradient projection + SVD (MLP only) + KL-only repair. B: A with hybrid repair (all four loss terms). C: B + attention pruning (ρattn = 0.01). D: Full MAAT (B + task vector negation; no attention pruning). Key observations from the full profiles. Across all three metrics, MAAT consistently occupies the … view at source ↗

**Figure 3.** Figure 3: Complete unlearning metric profiles per 5W label for LLaMA 3.2-3B (top row) and Gemma 3-4B (bottom [PITH_FULL_IMAGE:figures/full_fig_p013_3.png] view at source ↗

**Figure 4.** Figure 4: Label-wise harmonic mean (%) per 5W cate [PITH_FULL_IMAGE:figures/full_fig_p014_4.png] view at source ↗

read the original abstract

Machine unlearning evaluation is structurally skewed: Why-type questions, which probe causal and relational knowledge, comprise less than 0.06% of CounterFact, 0.6% of ZSRE, and less than 1.3% of TOFU, MUSE, and WMDP-Cyber. This near-zero representation means that methods that fail on causal knowledge can score highly in aggregate, and this failure is undetectable without balanced evaluation. We present 5WBENCH, a balanced 5,000-sample benchmark with 1,000 examples per 5W category (Who, What, When, Where, Why), making causal unlearning failures quantifiable for the first time. Using 5WBENCH, we show that no existing baseline simultaneously achieves high forgetting and high retention on Why-type questions: aggressive forgetting degrades retained knowledge, while conservative methods fail to forget causal facts. Why-type difficulty stems from multi-hop reasoning chains (44% of Why entries vs. less than or equal to 2% for others) and gradient dilution over 40.1-token answer spans. We present MAAT (Multi-phase Adapter-Aware Targeted Unlearning), a three-phase framework operating on LoRA adapter weights, combining gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state retain repair. MAAT is the first method to simultaneously achieve high forgetting and high retention on Why-type causal knowledge, reaching a new operating point on the forget-retain Pareto frontier. We make our code publicly available.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The benchmark 5WBENCH is the solid part here, but MAAT's three-phase method lacks the supporting details to back its Pareto claim.

read the letter

The paper identifies a real issue with how unlearning is evaluated. Why-type questions that test causal relations make up a tiny fraction of standard benchmarks like CounterFact and ZSRE. 5WBENCH balances this with 1000 examples per 5W category, which should make it easier to spot when methods fail on causal knowledge.

MAAT is their proposed fix, a three-phase process on LoRA adapters that mixes gradient-projected ascent, SVD rank pruning, task vector negation, and some KL-based retain repair. They say it is the first to get both good forgetting and good retention on the Why questions, which have more multi-hop chains and longer answer spans.

The benchmark construction is straightforward and addresses a documented gap. Making the code available is also positive.

The method side is thinner. The description stays at the level of listing components without showing how the phases are sequenced or why that sequence beats simpler baselines on the gradient dilution problem. No ablations or specific results are visible in the abstract, so the claim of a new operating point on the forget-retain frontier is not yet supported by evidence in the text. The stress-test note correctly flags that the interactions between phases are not demonstrated.

This work is aimed at people doing machine unlearning research, especially in areas where selective forgetting needs to preserve other knowledge. Someone building or evaluating unlearning methods for LLMs might find the benchmark useful for more complete testing.

It should go to peer review. The evaluation point is worth referee attention, and the method can be assessed once the full experiments and ablations are reviewed.

Referee Report

3 major / 2 minor

Summary. The paper argues that existing machine unlearning benchmarks severely underrepresent Why-type causal questions (less than 0.06% in CounterFact, etc.), introduces the balanced 5WBENCH benchmark with 1,000 examples per 5W category, demonstrates that no prior method achieves both high forgetting and high retention on Why-type entries due to multi-hop chains and gradient dilution, and proposes MAAT, a three-phase LoRA adapter method combining gradient-projected ascent, SVD rank pruning, task vector negation, and hybrid KL retain repair, claiming it is the first to reach a new operating point on the forget-retain Pareto frontier for causal knowledge. Code is released publicly.

Significance. If the experimental claims hold, the work would be significant for exposing and addressing a structural gap in unlearning evaluation on causal reasoning, with the new benchmark enabling quantifiable assessment of failures on Why-type questions. The public code release supports reproducibility. The central result on advancing the Pareto frontier for multi-hop causal unlearning would be of interest if the three-phase interactions are shown to generalize.

major comments (3)

[§3] §3 (MAAT three-phase framework): The claim that the specific ordering of gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state repair overcomes gradient dilution on 40.1-token multi-hop Why answers is load-bearing for the Pareto-frontier advance, yet no derivation, interaction analysis, or equations are supplied showing why this sequence succeeds where single-phase baselines fail.
[§5] §5 (Experimental results on 5WBENCH): The superiority of MAAT on Why-type entries (44% multi-hop) is asserted without ablation tables isolating the contribution of each phase or the effect of phase ordering; this prevents verification that the reported balance is due to the proposed combination rather than benchmark-specific tuning.
[Table in §5] Table reporting Why-type forget/retain metrics: No error bars, multiple random seeds, or statistical tests are mentioned for the claimed new operating point, weakening the robustness of the central performance claim relative to baselines.

minor comments (2)

[Abstract] The abstract states 'less than or equal to 2%' for other categories; consistent use of inequality symbols or exact percentages would improve readability.
[§2] The description of 5WBENCH construction would benefit from a brief example of a Why-type multi-hop question in the main text to illustrate the 40.1-token span issue.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback highlighting areas where the MAAT framework and experimental claims require stronger justification and validation. We address each major comment below and will revise the manuscript to incorporate the suggested improvements.

read point-by-point responses

Referee: [§3] §3 (MAAT three-phase framework): The claim that the specific ordering of gradient-projected ascent, SVD rank-dimension pruning, task vector negation, and hybrid KL-hidden-state repair overcomes gradient dilution on 40.1-token multi-hop Why answers is load-bearing for the Pareto-frontier advance, yet no derivation, interaction analysis, or equations are supplied showing why this sequence succeeds where single-phase baselines fail.

Authors: We agree that the manuscript would benefit from explicit justification of the phase ordering. In the revision we will add a dedicated analysis subsection to §3 that provides the interaction rationale: gradient-projected ascent first identifies and amplifies the causal directions in the adapter weights, SVD rank pruning then removes low-magnitude dimensions that would otherwise dilute the update across the long 40.1-token spans, task-vector negation inverts the retained direction, and the hybrid KL-hidden-state repair restores multi-hop consistency on the retain set. We will include the corresponding update equations and a small set of interaction plots comparing the proposed sequence against single-phase and reordered variants. revision: yes
Referee: [§5] §5 (Experimental results on 5WBENCH): The superiority of MAAT on Why-type entries (44% multi-hop) is asserted without ablation tables isolating the contribution of each phase or the effect of phase ordering; this prevents verification that the reported balance is due to the proposed combination rather than benchmark-specific tuning.

Authors: We accept that the current version lacks the necessary ablations. The revised §5 will contain a new ablation table that reports Why-type forget and retain scores after (i) ablating each phase individually and (ii) testing all six possible phase permutations. These results will quantify the incremental contribution of each component and confirm that the reported operating point is attributable to the specific three-phase combination rather than incidental tuning. revision: yes
Referee: [Table in §5] Table reporting Why-type forget/retain metrics: No error bars, multiple random seeds, or statistical tests are mentioned for the claimed new operating point, weakening the robustness of the central performance claim relative to baselines.

Authors: We will revise the table in §5 to report means and standard deviations computed over five independent random seeds. In addition, we will add pairwise statistical significance tests (Wilcoxon signed-rank) between MAAT and each baseline, with p-values reported in the table caption or a supplementary note. This will provide the requested evidence of robustness for the claimed Pareto-frontier advance. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical method combination with no self-referential derivations

full rationale

The paper presents MAAT as an engineering combination of existing techniques (gradient-projected ascent, SVD rank pruning, task vector negation, hybrid KL repair) applied in three phases to LoRA weights, evaluated on a newly introduced benchmark 5WBENCH. No equations, derivations, or parameter-fitting steps are described in the provided text that reduce by construction to the inputs. Claims of Pareto-frontier improvement are empirical and benchmark-specific rather than derived from self-definitions or self-citations. The derivation chain is self-contained as a proposed practical framework without load-bearing reductions to fitted values or prior author results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; all technical details are absent.

pith-pipeline@v0.9.1-grok · 5826 in / 1031 out tokens · 30826 ms · 2026-06-29T08:14:06.694050+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

10 extracted references · 5 canonical work pages · 1 internal anchor

[1]

Jack Foster, Stefan Schoepf, and Alexandra Brintrup

Alphaedit: Null-space constrained knowledge editing for language models.CoRR, abs/2410.02355. Jack Foster, Stefan Schoepf, and Alexandra Brintrup

work page arXiv
[2]

Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite

Fast machine unlearning without retrain- ing through selective synaptic dampening.CoRR, abs/2308.07707. Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite. 2024. Mech- anistic unlearning: Robust knowledge unlearning and editing via mechanistic localization.CoRR, abs/2410.12949. Edward J. Hu, Yelong Shen, Phillip Wallis,...

work page arXiv 2024
[3]

Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, and Zhiguo Wang

OpenReview.net. Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, and Zhiguo Wang. 2024. Propagation and pitfalls: Reasoning-based assessment of knowl- edge editing through counterfactual tasks.CoRR, abs/2401.17585. Gabriel Ilharco, Marco Túlio Ribeiro, Mitchell Worts- man, Suchin Gururangan, Ludwig Schmidt, Han- naneh Hajishirzi, and Ali Farh...

work page arXiv 2024
[4]

Gemma Team

MUSE: machine unlearning six-way evalua- tion for language models.CoRR, abs/2407.06460. Gemma Team. 2024a. Gemma: Open models based on gemini research and technology.CoRR, abs/2403.08295. Llama Team. 2024b. The llama 3 herd of models. CoRR, abs/2407.21783. Pratiksha Thaker, Shengyuan Hu, Neil Kale, Yash Mau- rya, Zhiwei Steven Wu, and Virginia Smith. 2024...

work page arXiv 2024
[5]

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Negative preference optimization: From catastrophic collapse to effective unlearning.CoRR, abs/2404.05868. Zhihua Zhang. 2015. The singular value decomposition, applications and beyond.CoRR, abs/1510.08532. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph...

work page internal anchor Pith review Pith/arXiv arXiv 2015
[6]

Focus ONLY on whether the ground truth is present—ignore any extra or wrong information in the model answer
[7]

Monday evening

Semantic/paraphrase matches count (e.g. “Monday evening” = “Monday night”)
[8]

Partial containment counts if the core fact is present
[9]

For “why” questions: the core causal reason must be present, not just surface word overlap
[10]

contains_ground_truth

Case-insensitive matching. Respond ONLY with one of: {"contains_ground_truth": true} {"contains_ground_truth": false} I Qualitative Examples: Why-Type Unlearning Tables 11 and 12 show representative generation traces from Llama 3.2-3B onWhy-type evalua- tion samples, illustrating how each method handles causal knowledge on the forget and retain splits. Fo...

2019

[1] [1]

Jack Foster, Stefan Schoepf, and Alexandra Brintrup

Alphaedit: Null-space constrained knowledge editing for language models.CoRR, abs/2410.02355. Jack Foster, Stefan Schoepf, and Alexandra Brintrup

work page arXiv

[2] [2]

Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite

Fast machine unlearning without retrain- ing through selective synaptic dampening.CoRR, abs/2308.07707. Phillip Guo, Aaquib Syed, Abhay Sheshadri, Aidan Ewart, and Gintare Karolina Dziugaite. 2024. Mech- anistic unlearning: Robust knowledge unlearning and editing via mechanistic localization.CoRR, abs/2410.12949. Edward J. Hu, Yelong Shen, Phillip Wallis,...

work page arXiv 2024

[3] [3]

Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, and Zhiguo Wang

OpenReview.net. Wenyue Hua, Jiang Guo, Mingwen Dong, Henghui Zhu, Patrick Ng, and Zhiguo Wang. 2024. Propagation and pitfalls: Reasoning-based assessment of knowl- edge editing through counterfactual tasks.CoRR, abs/2401.17585. Gabriel Ilharco, Marco Túlio Ribeiro, Mitchell Worts- man, Suchin Gururangan, Ludwig Schmidt, Han- naneh Hajishirzi, and Ali Farh...

work page arXiv 2024

[4] [4]

Gemma Team

MUSE: machine unlearning six-way evalua- tion for language models.CoRR, abs/2407.06460. Gemma Team. 2024a. Gemma: Open models based on gemini research and technology.CoRR, abs/2403.08295. Llama Team. 2024b. The llama 3 herd of models. CoRR, abs/2407.21783. Pratiksha Thaker, Shengyuan Hu, Neil Kale, Yash Mau- rya, Zhiwei Steven Wu, and Virginia Smith. 2024...

work page arXiv 2024

[5] [5]

Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning

Negative preference optimization: From catastrophic collapse to effective unlearning.CoRR, abs/2404.05868. Zhihua Zhang. 2015. The singular value decomposition, applications and beyond.CoRR, abs/1510.08532. Lianmin Zheng, Wei-Lin Chiang, Ying Sheng, Siyuan Zhuang, Zhanghao Wu, Yonghao Zhuang, Zi Lin, Zhuohan Li, Dacheng Li, Eric P. Xing, Hao Zhang, Joseph...

work page internal anchor Pith review Pith/arXiv arXiv 2015

[6] [6]

Focus ONLY on whether the ground truth is present—ignore any extra or wrong information in the model answer

[7] [7]

Monday evening

Semantic/paraphrase matches count (e.g. “Monday evening” = “Monday night”)

[8] [8]

Partial containment counts if the core fact is present

[9] [9]

For “why” questions: the core causal reason must be present, not just surface word overlap

[10] [10]

contains_ground_truth

Case-insensitive matching. Respond ONLY with one of: {"contains_ground_truth": true} {"contains_ground_truth": false} I Qualitative Examples: Why-Type Unlearning Tables 11 and 12 show representative generation traces from Llama 3.2-3B onWhy-type evalua- tion samples, illustrating how each method handles causal knowledge on the forget and retain splits. Fo...

2019