AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

Nazli Goharian; Ophir Frieder; Reihaneh Iranmanesh; Saeedeh Davoudi

arxiv: 2605.30599 · v1 · pith:LWCD26DSnew · submitted 2026-05-28 · 💻 cs.LG · cs.CL

AMNESIA: A Large Scale Medical Unlearning Benchmark Suite with Disease-Informed Analysis

Saeedeh Davoudi , Reihaneh Iranmanesh , Ophir Frieder , Nazli Goharian This is my paper

Pith reviewed 2026-06-29 08:13 UTC · model grok-4.3

classification 💻 cs.LG cs.CL

keywords medical unlearningmachine unlearningbenchmarkpatient notesdisease categoriesclinical inferenceterminology leakageLLM

0 comments

The pith

Unlearning one patient's medical data erodes a model's knowledge of other patients sharing the same disease.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents AMNESIA as a benchmark of 70,560 question-answer pairs drawn from 8,820 patient notes across 11 disease categories to study machine unlearning in medical language models. It tests four standard unlearning methods on both individual patients and entire disease groups, and introduces a metric that tracks whether medical terminology leaks through after unlearning. The central finding is that removing data tied to one patient impairs the model's ability to answer factual and reasoning questions about other patients with the identical condition. This pattern indicates that current unlearning techniques cannot cleanly separate personal patient details from the shared clinical knowledge that defines a disease category. The work therefore calls for unlearning approaches designed specifically for domains where individual records and categorical medical facts overlap.

Core claim

AMNESIA shows that unlearning individual patients erodes knowledge of others with the same condition. The benchmark supplies 70,560 factual and reasoning QA pairs from 8,820 patient notes in 11 disease categories. When four common unlearning methods are applied at the patient level, performance on same-disease cases declines; disease-level unlearning produces different leakage patterns. A new terminology-leakage metric quantifies how medical terms remain accessible after unlearning. These results establish that patient-specific facts and shared clinical knowledge are entangled in trained models and that existing methods do not respect this entanglement.

What carries the argument

The AMNESIA benchmark suite, which organizes large-scale QA pairs by disease category to expose interference between patient-specific facts and shared clinical knowledge during unlearning.

If this is right

Unlearning methods must be tested at both random-patient and disease-group scales to detect cross-patient interference.
Medical unlearning requires explicit mechanisms to isolate individual records from condition-level clinical patterns.
A terminology-leakage metric provides a practical way to measure whether shared medical vocabulary survives unlearning.
Factual recall and clinical-reasoning questions both reveal the same erosion pattern, indicating the problem is not limited to rote memorization.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Benchmarks that separate individual and categorical knowledge may be needed in other regulated domains such as legal or financial records.
Training regimes that tag disease-level versus patient-level information at the data stage could reduce the interference observed here.
Regulatory requirements to remove patient data may force periodic re-evaluation of model performance on related conditions.

Load-bearing premise

The constructed QA pairs and disease categories sufficiently capture the distinction between patient-specific facts and shared clinical knowledge that unlearning methods must respect.

What would settle it

A result in which unlearning one patient's records leaves model accuracy on other patients with the same disease completely unchanged would falsify the central claim.

Figures

Figures reproduced from arXiv: 2605.30599 by Nazli Goharian, Ophir Frieder, Reihaneh Iranmanesh, Saeedeh Davoudi.

**Figure 2.** Figure 2: Unlearning performance at random patient [PITH_FULL_IMAGE:figures/full_fig_p006_2.png] view at source ↗

**Figure 3.** Figure 3: Unlearning performance at disease-level. Horizontal axis is MU on [PITH_FULL_IMAGE:figures/full_fig_p007_3.png] view at source ↗

**Figure 4.** Figure 4: Retain MU on seen questions for same-disease [PITH_FULL_IMAGE:figures/full_fig_p008_4.png] view at source ↗

read the original abstract

Medical knowledge is continuously evolving. This creates a need to update or selectively forget information encoded in already-trained medical LLMs. Machine unlearning aims to remove the influence of specific training data from a model without full retraining. Yet, existing unlearning benchmarks rely on synthetic or small-scale general data, leaving clinical unlearning understudied. We introduce AMNESIA, the first large-scale, open source benchmark for medical unlearning, with 70,560 question-answer pairs from 8,820 patient notes across 11 disease categories. AMNESIA includes both factual questions testing direct recall and reasoning questions testing clinical inference. We use it to evaluate four widely used unlearning methods at both random patient and disease-level, and introduce a new metric for detecting leakage of medical terminology. We show that unlearning individual patients erodes knowledge of others with the same condition, calling for methods that can better separate patients from shared clinical knowledge.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

AMNESIA supplies the first sizable open medical unlearning benchmark and documents that patient-level removal hurts same-disease cases, but the patient-versus-shared distinction in the QA pairs needs explicit validation.

read the letter

The main takeaway is that this paper ships a 70k-pair benchmark drawn from real patient notes across 11 disease categories, plus a terminology-leakage metric, and reports that unlearning one patient degrades results on others sharing the condition.

The scale and domain focus are the clear advances. Earlier unlearning work stayed on synthetic or general text; here the authors built both factual recall and clinical-reasoning questions, ran four standard methods at patient and disease granularity, and released the data. That setup lets people measure exactly the cross-patient leakage the abstract flags.

The soft spot sits in the core assumption. The erosion result only carries weight if the QA pairs truly isolate patient-specific facts from the shared clinical knowledge that any model would learn across the disease category. The abstract states the distinction but gives no construction details, inter-annotator checks, or controls that would show the questions are not just rephrasing general medical facts. Without those, the finding risks being partly an artifact of how the test set was written.

The new leakage metric is a reasonable addition, yet its calibration and false-positive behavior are not described either. These gaps are fixable with more appendix material rather than fatal.

The paper is aimed at groups building or auditing unlearning pipelines for clinical LLMs. Anyone who needs a concrete testbed for privacy updates in healthcare models will find the dataset and the reported pattern useful even before the analysis is tightened.

It deserves a serious referee. The benchmark itself is large enough and open enough to matter, and the empirical observation points to a practical limitation that future methods must address.

Referee Report

2 major / 2 minor

Summary. The paper introduces AMNESIA, the first large-scale open-source benchmark for medical unlearning, with 70,560 QA pairs derived from 8,820 patient notes across 11 disease categories. It includes both factual recall and clinical reasoning questions, evaluates four standard unlearning methods at the individual-patient and disease levels, proposes a new metric for medical terminology leakage, and reports that patient-level unlearning degrades performance on other patients sharing the same disease.

Significance. If the erosion result is robust, the benchmark is significant because it supplies a clinically grounded testbed that existing synthetic or small-scale unlearning suites lack. The scale, the split between factual and reasoning questions, the disease-informed grouping, and the open release constitute concrete strengths that can drive development of methods able to separate patient-specific facts from shared clinical knowledge. The terminology-leakage metric is a useful addition for evaluation in the medical domain.

major comments (2)

[Benchmark Construction] Benchmark Construction section: the central claim that unlearning one patient erodes performance on others with the same condition rests on the assumption that the QA pairs and 11 disease categories cleanly separate patient-specific facts from shared clinical knowledge. The manuscript supplies no explicit construction protocol, examples of how factual versus reasoning questions were authored, or validation that the groupings achieve this separation; without these details the erosion pattern cannot be interpreted as evidence for the claimed limitation of current methods.
[Evaluation] Evaluation section: the reported erosion finding is presented without statistical significance tests, confidence intervals, or ablation on the choice of disease groupings; because the claim is quantitative and load-bearing for the call for new methods, the absence of these controls weakens the evidential basis.

minor comments (2)

[Abstract] The abstract and title both use 'disease-informed analysis' but the manuscript should add a short paragraph clarifying how the 11 categories were chosen and whether they were validated by clinicians.
Figure captions should explicitly state which unlearning method and which metric (including the new leakage metric) are plotted in each panel.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the potential significance of AMNESIA. We address each major comment below and describe the revisions that will be incorporated.

read point-by-point responses

Referee: [Benchmark Construction] Benchmark Construction section: the central claim that unlearning one patient erodes performance on others with the same condition rests on the assumption that the QA pairs and 11 disease categories cleanly separate patient-specific facts from shared clinical knowledge. The manuscript supplies no explicit construction protocol, examples of how factual versus reasoning questions were authored, or validation that the groupings achieve this separation; without these details the erosion pattern cannot be interpreted as evidence for the claimed limitation of current methods.

Authors: We agree that greater transparency on construction is needed to support interpretation of the erosion results. The current manuscript describes the high-level process (extraction from 8,820 notes into 70,560 QA pairs across 11 ICD-10-aligned categories, with factual items drawn directly from notes and reasoning items derived from clinical guidelines), but does not provide the full protocol, sample pairs, or expert validation steps. In the revision we will add an explicit construction protocol subsection, representative examples of both factual and reasoning questions per disease, and a description of the medical-expert review used to confirm separation of patient-specific versus shared knowledge. These additions will allow readers to assess whether the observed erosion indeed indicates limitations in separating patient facts from disease-level knowledge. revision: yes
Referee: [Evaluation] Evaluation section: the reported erosion finding is presented without statistical significance tests, confidence intervals, or ablation on the choice of disease groupings; because the claim is quantitative and load-bearing for the call for new methods, the absence of these controls weakens the evidential basis.

Authors: We concur that quantitative claims require statistical controls. The revision will include paired t-tests with p-values and 95% confidence intervals on the performance drops after patient-level unlearning, plus an ablation that varies the disease groupings (e.g., coarser vs. finer partitions) to test robustness of the erosion pattern. These analyses will be reported in the updated Evaluation section and will strengthen the evidential basis for recommending new methods that better isolate patient-specific information. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical benchmark introduction that constructs a dataset of QA pairs from patient notes and evaluates existing unlearning methods on it; no derivations, equations, fitted parameters, or self-citational load-bearing steps are present that reduce any claim to its own inputs by construction. The reported pattern (erosion of related-patient knowledge) follows directly from the benchmark design once the groupings are accepted, with no internal reduction or renaming of results.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

Abstract-only review provides no information on free parameters, axioms, or invented entities; the contribution is the benchmark construction itself.

pith-pipeline@v0.9.1-grok · 5706 in / 1050 out tokens · 28143 ms · 2026-06-29T08:13:57.614120+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

25 extracted references · 3 canonical work pages · 3 internal anchors

[1]

InProceed- ings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 237–247, Toronto, Canada

Multiple evidence combination for fact- checking of health-related information. InProceed- ings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 237–247, Toronto, Canada. Association for Computational Linguistics. Vineeth Dorna, Anmol Reddy Mekala, Wenlong Zhao, Andrew McCallum, J Zico Kolter, Zachary Chase Lip...

2026
[2]

Scientific Data, 6

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6. 9 Yunsoo Kim, Yusuf Abdulle, and Honghan Wu. 2025. BioHopR: A benchmark for multi-hop, multi-answer reasoning in biomedical domain. InFindings of the Association for Computational Linguistics: ACL 2025, pages 12894–12908, Vienna, Austria...

2025
[3]

MedGemma Technical Report

MedHallu: A comprehensive benchmark for detecting medical hallucinations in large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2858–2873, Suzhou, China. Association for Computational Linguistics. Protection Regulation. 2016. Regulation (eu) 2016/679 of the european parliament and of the c...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[4]

OpenAI GPT-5 System Card

MUSE: Machine unlearning six-way evalua- tion for language models. InThe Thirteenth Interna- tional Conference on Learning Representations. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 oth- ers. 2025. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267....

work page internal anchor Pith review Pith/arXiv arXiv 2025
[5]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Sangyeon Yoon, Wonje Jeung, and Albert No. 2025. R- TOFU: Unlearning in large reasoning models. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 5239– 5258, Suzhou, China. Association for Computational Linguistics. Ruiqi Zhang, Licong Lin, Yu Bai, an...

work page internal anchor Pith review Pith/arXiv arXiv 2025
[6]

Anemia or Iron Deficiency
[7]

Arthritis and Bone • Arthritis • Osteoporosis
[8]

Cardiovascular • Cerebrovascular Disease or Stroke • Cholesterol • Heart Disease • Hypertension
[9]

Dementia and Mental Health • Alzheimer’s Disease • Attention Deficit Hyperactivity Disorder • Depression • Mental Health
[10]

Digestive and Liver • Digestive Diseases • Chronic Liver Disease and Cirrhosis
[11]

Infectious or Immune Diseases • AIDS and HIV • Viral Hepatitis • Infectious Disease • Influenza • Measles, Mumps, and Rubella • Pneumonia • Sexually Transmitted Diseases (STD) • Chronic Sinusitis • Whooping Cough or Pertussis
[12]

Oral and Dental Health
[13]

Other” and explain briefly. Return valid JSON only in this exact format: {

Respiratory and Allergies • Allergies and Hay Fever • Asthma • Chronic Obstructive Pulmonary Disease Rules: • Return the best matching top-level category only. • Base the decision mainly on the article_title. • Use the subcategories only as guidance for mapping into the correct top-level category. • Do not use any category outside the taxonomy. • If no ca...
[14]

factual_items • direct recall from the summary • answerable solely from the provided summary • answers must be 1–2 sentences • cover diagnosis, symptoms, labs, treatments, out- comes
[15]

% # Patients # QAs Available For Forget Splits 5 441 2,646 Random, Cancer, Inf/Imm., Cardio

reasoning_items • multi-hop clinical reasoning • each question MUST connect at least TWO findings from the summary • answers should explain the reasoning chain in 2–3 11 sentences Global rules: • Do NOT ask about patient name, age, gender, or any PII • Questions must REQUIRE this specific patient’s details and must NOT be answered confidently from general...

2048
[16]

Named entities identified by the biomedical en- tity recognizer
[17]

Noun chunks from the dependency parser
[18]

lung cancer

Filtering: remove common clinical terms, phrases<3 characters, and phrases>6 words Relevance ScoringEach candidate phrase re- ceives a lexical disease relevance score based on its overlap with seed terms: •Exact match:score = 1.0 • Multi-word phrase containment:score = 0.9 (e.g., “lung cancer” contains seed “lung cancer”) • Partial word overlap:score = 0....

2025
[19]

What is the primary medical topic of this title?
[20]

Which of the three categories does it best fit?
[21]

reasoning

Does the assigned category match? Respond ONLY with this JSON (no markdown, no extra text): { "reasoning": "<1-2 sentence explanation>", "verdict": "<Supported | Unsupported | Ambiguous>", "correct_category": "<Cardiovascular | Cancer | Infec- tious or Immune Diseases | Unclear>" } F.2 Panel Results on 900 Titles The panel evaluated all 900 sampled patien...
[22]

The question is clear and well-formed
[23]

The answer is correct based on the patient note
[24]

The answer can be derived from the information in the patient note
[25]

valid": true/false,

The question and answer are clinically meaningful Provide your evaluation in JSON format: { "valid": true/false, "reasoning": "Brief explanation of your decision (2-3 sentences)" } Be strict in your evaluation. If ANY criterion is not met, mark as invalid. G.2 Panel Results Table 12 summarizes verdicts at the judge and panel level over the N= 1,000 triple...

[1] [1]

InProceed- ings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 237–247, Toronto, Canada

Multiple evidence combination for fact- checking of health-related information. InProceed- ings of the 22nd Workshop on Biomedical Natural Language Processing and BioNLP Shared Tasks, pages 237–247, Toronto, Canada. Association for Computational Linguistics. Vineeth Dorna, Anmol Reddy Mekala, Wenlong Zhao, Andrew McCallum, J Zico Kolter, Zachary Chase Lip...

2026

[2] [2]

Scientific Data, 6

Mimic-cxr, a de-identified publicly available database of chest radiographs with free-text reports. Scientific Data, 6. 9 Yunsoo Kim, Yusuf Abdulle, and Honghan Wu. 2025. BioHopR: A benchmark for multi-hop, multi-answer reasoning in biomedical domain. InFindings of the Association for Computational Linguistics: ACL 2025, pages 12894–12908, Vienna, Austria...

2025

[3] [3]

MedGemma Technical Report

MedHallu: A comprehensive benchmark for detecting medical hallucinations in large language models. InProceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 2858–2873, Suzhou, China. Association for Computational Linguistics. Protection Regulation. 2016. Regulation (eu) 2016/679 of the european parliament and of the c...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[4] [4]

OpenAI GPT-5 System Card

MUSE: Machine unlearning six-way evalua- tion for language models. InThe Thirteenth Interna- tional Conference on Learning Representations. Aaditya Singh, Adam Fry, Adam Perelman, Adam Tart, Adi Ganesh, Ahmed El-Kishky, Aidan McLaughlin, Aiden Low, AJ Ostrow, Akhila Ananthram, and 1 oth- ers. 2025. Openai gpt-5 system card.arXiv preprint arXiv:2601.03267....

work page internal anchor Pith review Pith/arXiv arXiv 2025

[5] [5]

Qwen3 Technical Report

Qwen3 technical report.arXiv preprint arXiv:2505.09388. Sangyeon Yoon, Wonje Jeung, and Albert No. 2025. R- TOFU: Unlearning in large reasoning models. InPro- ceedings of the 2025 Conference on Empirical Meth- ods in Natural Language Processing, pages 5239– 5258, Suzhou, China. Association for Computational Linguistics. Ruiqi Zhang, Licong Lin, Yu Bai, an...

work page internal anchor Pith review Pith/arXiv arXiv 2025

[6] [6]

Anemia or Iron Deficiency

[7] [7]

Arthritis and Bone • Arthritis • Osteoporosis

[8] [8]

Cardiovascular • Cerebrovascular Disease or Stroke • Cholesterol • Heart Disease • Hypertension

[9] [9]

Dementia and Mental Health • Alzheimer’s Disease • Attention Deficit Hyperactivity Disorder • Depression • Mental Health

[10] [10]

Digestive and Liver • Digestive Diseases • Chronic Liver Disease and Cirrhosis

[11] [11]

Infectious or Immune Diseases • AIDS and HIV • Viral Hepatitis • Infectious Disease • Influenza • Measles, Mumps, and Rubella • Pneumonia • Sexually Transmitted Diseases (STD) • Chronic Sinusitis • Whooping Cough or Pertussis

[12] [12]

Oral and Dental Health

[13] [13]

Other” and explain briefly. Return valid JSON only in this exact format: {

Respiratory and Allergies • Allergies and Hay Fever • Asthma • Chronic Obstructive Pulmonary Disease Rules: • Return the best matching top-level category only. • Base the decision mainly on the article_title. • Use the subcategories only as guidance for mapping into the correct top-level category. • Do not use any category outside the taxonomy. • If no ca...

[14] [14]

factual_items • direct recall from the summary • answerable solely from the provided summary • answers must be 1–2 sentences • cover diagnosis, symptoms, labs, treatments, out- comes

[15] [15]

% # Patients # QAs Available For Forget Splits 5 441 2,646 Random, Cancer, Inf/Imm., Cardio

reasoning_items • multi-hop clinical reasoning • each question MUST connect at least TWO findings from the summary • answers should explain the reasoning chain in 2–3 11 sentences Global rules: • Do NOT ask about patient name, age, gender, or any PII • Questions must REQUIRE this specific patient’s details and must NOT be answered confidently from general...

2048

[16] [16]

Named entities identified by the biomedical en- tity recognizer

[17] [17]

Noun chunks from the dependency parser

[18] [18]

lung cancer

Filtering: remove common clinical terms, phrases<3 characters, and phrases>6 words Relevance ScoringEach candidate phrase re- ceives a lexical disease relevance score based on its overlap with seed terms: •Exact match:score = 1.0 • Multi-word phrase containment:score = 0.9 (e.g., “lung cancer” contains seed “lung cancer”) • Partial word overlap:score = 0....

2025

[19] [19]

What is the primary medical topic of this title?

[20] [20]

Which of the three categories does it best fit?

[21] [21]

reasoning

Does the assigned category match? Respond ONLY with this JSON (no markdown, no extra text): { "reasoning": "<1-2 sentence explanation>", "verdict": "<Supported | Unsupported | Ambiguous>", "correct_category": "<Cardiovascular | Cancer | Infec- tious or Immune Diseases | Unclear>" } F.2 Panel Results on 900 Titles The panel evaluated all 900 sampled patien...

[22] [22]

The question is clear and well-formed

[23] [23]

The answer is correct based on the patient note

[24] [24]

The answer can be derived from the information in the patient note

[25] [25]

valid": true/false,

The question and answer are clinically meaningful Provide your evaluation in JSON format: { "valid": true/false, "reasoning": "Brief explanation of your decision (2-3 sentences)" } Be strict in your evaluation. If ANY criterion is not met, mark as invalid. G.2 Panel Results Table 12 summarizes verdicts at the judge and panel level over the N= 1,000 triple...