pith. machine review for the scientific record. sign in

arxiv: 2605.02206 · v2 · submitted 2026-05-04 · 💻 cs.CV · cs.LG

Recognition: no theorem link

Metric Unreliability in Multimodal Machine Unlearning: A Systematic Analysis and Principled Unified Score

Authors on Pith no claims yet

Pith reviewed 2026-05-11 01:42 UTC · model grok-4.3

classification 💻 cs.CV cs.LG
keywords machine unlearningmetric reliabilityvision-language modelsmultimodal evaluationunified quality scoreVQA benchmarksKendall tau analysis
0
0 comments X

The pith

Five standard metrics for multimodal unlearning give conflicting rankings, but a new unified score weighted by oracle correlations stabilizes them.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that five common metrics for judging machine unlearning in vision-language models often rank the same methods differently across VQA benchmarks. This inconsistency appears in two opposing clusters of metrics and grows larger when models handle both images and text than when they handle one modality alone. The authors create the Unified Quality Score by weighting each metric according to its correlation with an oracle model retrained only on data that should be kept. If correct, this approach replaces unreliable single-metric judgments with one composite number that holds steady even when small changes are made to the weights.

Core claim

Five standard metrics yield conflicting method rankings across three VQA benchmarks, with Kendall tau analysis revealing two opposing clusters {FA, RA, MIA} and {AD, JS}, and lower average agreement in multimodal VQA than in unimodal classification. The Unified Quality Score, formed by weighting each metric by its Spearman correlation with the oracle distance d(M_hat, M_star) where M_star is the model retrained only on the retain set, produces stable rankings under 100 random weight perturbations, with RA showing the strongest positive correlation and FA a negative correlation.

What carries the argument

The Unified Quality Score (UQS), a composite formed by weighting the five metrics according to each one's Spearman correlation with the oracle distance to the retain-only retrained model.

If this is right

  • Unlearning methods receive more consistent orderings when evaluated with the UQS than with any one of the five individual metrics.
  • Retain Accuracy aligns best with the oracle distance while Forget Accuracy aligns worst.
  • Metric agreement drops in multimodal VQA settings relative to unimodal classification tasks.
  • The released benchmark, 36 model checkpoints, and leaderboard allow direct reproduction and extension of the ranking comparisons.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

  • If UQS becomes standard, future papers on multimodal unlearning could converge on shared method rankings rather than continuing to cite different metrics for opposite conclusions.
  • The split into accuracy-based and divergence-based clusters indicates that the two groups track different properties of successful unlearning.
  • Repeating the analysis on larger VLMs or non-VQA tasks would test whether the observed unreliability is tied to current model sizes and data types.

Load-bearing premise

The distance between an unlearned model and a model retrained only on the retain set supplies the correct ground truth for deciding which metrics are reliable.

What would settle it

If rankings produced by the UQS change when the oracle is replaced by another proxy such as measured privacy leakage or human-rated forgetting quality, the claim that the score provides stable and reliable evaluation would be refuted.

Figures

Figures reproduced from arXiv: 2605.02206 by Abdullah Ahmad Khan, Ferdous Sohel, Hamid Laga.

Figure 3
Figure 3. Figure 3: Finding 2 (left): multimodal amplifies metric disagreement. Finding 3 (right): view at source ↗
Figure 4
Figure 4. Figure 4: UQS stability across 100 Dirichlet-sampled weight variations. Mean τ = 0.647 exceeds the τ = 0.5 robustness threshold. 7 Discussion Why FA is negatively reliable. Low FA arises from output collapse: gradient ascent suppresses responses without removing knowledge, increasing distance from M∗ (AD= 106), yielding ρFA = −0.418, p = 0.011. FA cannot distinguish targeted forgetting from collapse; collapsed model… view at source ↗
read the original abstract

Machine unlearning in Vision-Language Models (VLMs) is required for compliance with the General Data Protection Regulation (GDPR), yet current evaluation practices are inconsistent. We present the first systematic study of metric reliability in multimodal unlearning. Five standard metrics, Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), and JS divergence (JS), yield conflicting method rankings across three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench). Kendall tau analysis over 36 unlearned LLaVA-1.5-7B models reveals two opposing clusters, {FA, RA, MIA} and {AD, JS}, with tau_FA_AD = -0.26, reproduced on BLIP-2 OPT-2.7B. Agreement is lower in multimodal VQA (average tau = 0.086) than in unimodal classification (average tau = 0.158; difference = 0.072), indicating that dual image-and-text pathways amplify inconsistency. We introduce the Unified Quality Score (UQS), a composite metric with weights derived from each metric's Spearman correlation with the oracle distance d(M_hat, M_star), where M_star is the oracle model retrained only on the retain set. RA shows the strongest reliability (rho = 0.484, p = 0.003), while FA is negatively correlated (rho = -0.418, p = 0.011). UQS yields stable rankings under 100 random weight perturbations (tau = 0.647 +- 0.262). We release the benchmark, 36 checkpoints, and an interactive leaderboard. Code and pre-computed results are available at https://github.com/neurips26/UnifiedUnl.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript conducts the first systematic analysis of evaluation metric reliability in multimodal machine unlearning for vision-language models. Using 36 unlearned LLaVA-1.5-7B models on three VQA benchmarks (MLLMU-Bench, UnLOK-VQA, MMUBench), it shows that five standard metrics (Forget Accuracy (FA), Retain Accuracy (RA), Membership Inference Attack (MIA), Activation Distance (AD), JS divergence (JS)) yield conflicting method rankings, with Kendall tau analysis revealing two opposing clusters {FA, RA, MIA} and {AD, JS} (e.g., tau_FA_AD = -0.26) and lower average agreement in multimodal settings (0.086) than unimodal (0.158). The paper introduces the Unified Quality Score (UQS), a composite with weights from each metric's Spearman correlation with the oracle distance d(M_hat, M_star) where M_star is retrained only on the retain set, reports RA as most reliable (rho=0.484), and shows UQS rankings remain stable under 100 random weight perturbations (tau=0.647 +- 0.262). The work releases the benchmark, 36 checkpoints, and code.

Significance. If the central findings hold, the paper identifies a key challenge in unlearning evaluation for VLMs, where image-text pathways increase metric inconsistency compared to unimodal cases, and offers UQS as a data-driven way to unify metrics for more stable assessments. The public release of 36 model checkpoints, three benchmarks, an interactive leaderboard, and reproducible code is a clear strength that supports follow-on work in GDPR-compliant multimodal unlearning.

major comments (3)
  1. [UQS construction] UQS construction (abstract and methods): Weights are derived directly from Spearman correlations of each metric against the oracle distance d(M_hat, M_star), with RA reported as most reliable (rho = 0.484, p = 0.003). No sensitivity analysis to the specific definition of M_star (retrained from scratch on retain set only), alternative oracles (e.g., gradient ascent on forget set), or retraining hyperparameters is provided. This choice is load-bearing for both the reliability rankings and the claim that UQS is stable, as the entire validation rests on this internal ground truth.
  2. [Statistical reporting] Statistical reporting for correlations (abstract): The Spearman rho values and p-values used to set UQS weights are computed within the same 36-model experimental setup, yet no error bars, bootstrap intervals, or cross-validation details are reported. This weakens the robustness of the composite score and the claim that UQS yields stable rankings under weight perturbations.
  3. [Cross-modal comparison] Cross-modal comparison (abstract): The claim that agreement is lower in multimodal VQA (average tau = 0.086) than unimodal classification (average tau = 0.158, difference = 0.072) is central to arguing that dual pathways amplify inconsistency, but the unimodal setup (datasets, models, unlearning methods) is not detailed, making the quantitative difference hard to interpret or reproduce.
minor comments (2)
  1. [Abstract] The reproduction statement for BLIP-2 OPT-2.7B mentions the cluster structure but does not report whether the exact tau values or UQS stability hold identically; adding a short table or sentence would strengthen the generalizability claim.
  2. [Kendall tau analysis] Pairwise Kendall tau values (e.g., tau_FA_AD = -0.26) are given in text; a compact table of all pairs would improve readability of the opposing-cluster finding.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their constructive and detailed feedback. We address each major comment point by point below, indicating where revisions will be made to improve the manuscript's robustness and clarity.

read point-by-point responses
  1. Referee: [UQS construction] UQS construction (abstract and methods): Weights are derived directly from Spearman correlations of each metric against the oracle distance d(M_hat, M_star), with RA reported as most reliable (rho = 0.484, p = 0.003). No sensitivity analysis to the specific definition of M_star (retrained from scratch on retain set only), alternative oracles (e.g., gradient ascent on forget set), or retraining hyperparameters is provided. This choice is load-bearing for both the reliability rankings and the claim that UQS is stable, as the entire validation rests on this internal ground truth.

    Authors: We agree that sensitivity analysis to the oracle definition would strengthen the claims, as this choice underpins the reliability rankings. The retrained-from-scratch M_star on the retain set follows the standard oracle definition in the unlearning literature. In the revised manuscript we will add sensitivity experiments using alternative oracles (including gradient ascent on the forget set) and variations in retraining hyperparameters such as epochs and learning rate. We will report the resulting changes to UQS weights, metric reliability orderings, and stability under weight perturbations. revision: yes

  2. Referee: [Statistical reporting] Statistical reporting for correlations (abstract): The Spearman rho values and p-values used to set UQS weights are computed within the same 36-model experimental setup, yet no error bars, bootstrap intervals, or cross-validation details are reported. This weakens the robustness of the composite score and the claim that UQS yields stable rankings under weight perturbations.

    Authors: We acknowledge that uncertainty quantification is currently missing. In the revision we will add bootstrap confidence intervals (1000 resamples) for all reported Spearman rho values and p-values. We will also include error bars or standard deviations on the Kendall tau stability results (tau = 0.647 +- 0.262) obtained under the 100 random weight perturbations to better reflect variability. revision: yes

  3. Referee: [Cross-modal comparison] Cross-modal comparison (abstract): The claim that agreement is lower in multimodal VQA (average tau = 0.086) than unimodal classification (average tau = 0.158, difference = 0.072) is central to arguing that dual pathways amplify inconsistency, but the unimodal setup (datasets, models, unlearning methods) is not detailed, making the quantitative difference hard to interpret or reproduce.

    Authors: We agree that insufficient detail on the unimodal setup limits interpretability and reproducibility. In the revised manuscript we will add a dedicated methods subsection (or appendix) that fully specifies the unimodal datasets, models, unlearning methods, number of models, and exact evaluation protocol used to obtain the average tau of 0.158. This will allow readers to assess the reported difference of 0.072 and the claim that dual pathways increase inconsistency. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper computes Kendall tau values directly from the five raw metrics evaluated on the 36 unlearned models to demonstrate ranking conflicts and cluster structure; this step uses no fitted parameters or self-referential definitions. The UQS is introduced as a new composite whose weights are set by Spearman correlations against an externally constructed oracle d(M_hat, M_star) (M_star retrained from scratch on the retain set), which is an independent benchmark rather than a quantity derived from the metrics themselves. The subsequent stability test perturbs those weights randomly and re-computes rankings, constituting a robustness check rather than a reduction of any claim to the input data by construction. No self-citations, ansatzes, or uniqueness theorems are invoked as load-bearing steps, and the primary empirical findings remain self-contained against the reported model checkpoints and benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim depends on treating the retain-only retrained model as the ideal reference point and on using observed Spearman correlations to set metric weights; no new physical entities are postulated.

axioms (1)
  • domain assumption The model retrained solely on the retain set represents the ideal unlearned state against which metric quality should be judged.
    Invoked to define the oracle distance used for weighting the five metrics.
invented entities (1)
  • Unified Quality Score (UQS) no independent evidence
    purpose: Composite metric that produces stable unlearning rankings by weighting standard metrics according to their oracle correlation.
    Defined and validated within the 36-model experimental setup of this study.

pith-pipeline@v0.9.0 · 5641 in / 1367 out tokens · 41130 ms · 2026-05-11T01:42:24.362980+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

65 extracted references · 65 canonical work pages

  1. [1]

    General Data Protection Regulation. Regulation on the protection of natural persons with regard to the processing of personal data and on the free movement of such data, and repealing directive 95/46/ec (data protection directive). L119, page 188, 2016

  2. [2]

    Towards making systems forget with machine unlearning

    Yinzhi Cao and Junfeng Yang. Towards making systems forget with machine unlearning. In 2015 IEEE Symposium on Security and Privacy (SP), pages 463–480. IEEE, 2015. = 0.647 ± 0.262 Perfect agreement 10

  3. [3]

    Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot

    Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE, 2021

  4. [4]

    Guan, Gregory Valiant, and James Y

    Antonio Ginart, Melody Y. Guan, Gregory Valiant, and James Y. Zou. Making AI forget you: Data deletion in machine learning. In Advances in Neural Information Processing Systems , volume 32, 2019

  5. [5]

    Remember what you want to forget: Algorithms for machine unlearning

    Ayush Sekhari, Jayadev Acharya, Gautam Kamath, and Ananda Theertha Suresh. Remember what you want to forget: Algorithms for machine unlearning. In Advances in Neural Information Processing Systems, volume 34, pages 18075–18086, 2021

  6. [6]

    The right to be forgotten in federated learning: An efficient realization with rapid retraining

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary C Lipton, and J Zico Kolter. Tofu: A task of fictitious unlearning for llms. arXiv preprint arXiv:2401.06121, 2024

  7. [7]

    arXiv preprint arXiv:2407.06460 , year=

    Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A Smith, and Chiyuan Chang. MUSE: Machine unlearning six-way evaluation for language models. arXiv preprint arXiv:2407.06460, 2024

  8. [8]

    Protecting privacy in multimodal large language models with MLLMU -bench

    Zheyuan Liu, Guangyao Dou, Mengzhao Jia, Zhaoxuan Tan, Qingkai Zeng, Yongle Yuan, and Meng Jiang. Protecting privacy in multimodal large language models with MLLMU -bench. In Luis Chiruzzo, Alan Ritter, and Lu Wang, editors, Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human La...

  9. [9]

    Single image unlearning: Efficient machine unlearning in multimodal large language models

    Jiaqi Li, Qianshan Wei, Chuanyi Zhang, Guilin Qi, Miaozeng Du, Yongrui Chen, Sheng Bi, and Fan Liu. Single image unlearning: Efficient machine unlearning in multimodal large language models. Advances in Neural Information Processing Systems, 37:35414–35453, 2024

  10. [10]

    Long Ouyang, Jeff Wu, Xu Jiang, Diogo Almeida, Carroll L. Wainwright, Pamela Mishra, Chong Zhang, Sandhini Agarwal, Katarina Slama, Alex Ray, John Schulman, Jacob Hilton, Fraser Kelton, Luke Miller, Maddie Simens, Amanda Askell, Peter Welinder, Paul F. Christiano, Jan Leike, and Ryan Lowe. Training language models to follow instructions with human feedbac...

  11. [11]

    Visual instruction tuning

    Haotian Liu, Chunyuan Li, Qingyang Wu, and Yong Jae Lee. Visual instruction tuning. In Advances in Neural Information Processing Systems, volume 36, 2024

  12. [12]

    Improved baselines with visual instruction tuning

    Haotian Liu, Chunyuan Li, Yuheng Li, and Yong Jae Lee. Improved baselines with visual instruction tuning. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 26296–26306, 2024

  13. [13]

    A survey of machine unlearning

    Thanh Tam Nguyen, Thanh Trung Huynh, Zhao Ren, Phi Le Nguyen, Alan Wee-Chung Liew, Hongzhi Yin, and Quoc Viet Hung Nguyen. A survey of machine unlearning. ACM Transactions on Intelligent Systems and Technology, 16(5):1–46, 2025

  14. [14]

    Unlearning sensitive information in multimodal llms: Benchmark and attack -defense evaluation, 2025

    Vaidehi Patil, Yi-Lin Sung, Peter Hase, Jie Peng, Tianlong Chen, and Mohit Bansal. Unlearning sensitive information in multimodal llms: Benchmark and attack -defense evaluation, 2025. URL https://arxiv.org/abs/2505.01456

  15. [15]

    C., Kolter, J

    Vineeth Dorna, Anmol Mekala, Wenlong Zhao, Andrew McCallum, Zachary C Lipton, J Zico Kolter, and Pratyush Maini. OpenUnlearning: Accelerating LLM unlearning via unified benchmarking of methods and metrics. arXiv preprint arXiv:2506.12618, 2025

  16. [16]

    Large language model unlearning

    Yuanshun Yao, Xiaojun Xu, and Yang Liu. Large language model unlearning. In Advances in Neural Information Processing Systems, volume 37, 2024. 11

  17. [17]

    Knowledge unlearning for mitigating privacy risks in language models

    Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. Knowledge unlearning for mitigating privacy risks in language models. In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics, pages 14389–14408, 2023

  18. [18]

    Who’s harry potter? approximate unlearning for llms

    Ronen Eldan and Mark Russinovich. Who’s harry potter? approximate unlearning for llms. xxx, 2023

  19. [19]

    Rethinking machine unlearning for large language models

    Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, et al. Rethinking machine unlearning for large language models. Nature Machine Intelligence, 7(2):181–194, 2025

  20. [20]

    A lifelong learning perspective for mobile robot control

    Sebastian Thrun. A lifelong learning perspective for mobile robot control. In Intelligent robots and systems, pages 201–214. Elsevier, 1995

  21. [21]

    Eternal sunshine of the spotless net: Selective forgetting in deep networks

    Aditya Golatkar, Alessandro Achille, and Stefano Soatto. Eternal sunshine of the spotless net: Selective forgetting in deep networks. In Proceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 9304–9312, 2020

  22. [22]

    DeltaGrad: Rapid retraining of machine learning models

    Yinjun Wu, Edgar Dobriban, and Susan Davidson. DeltaGrad: Rapid retraining of machine learning models. In International Conference on Machine Learning, pages 10355–10366, 2020

  23. [23]

    SalUn: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation

    Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Dennis Wei, and Sijia Liu. SalUn: Empowering machine unlearning via gradient-based weight saliency in both image classification and generation. In International Conference on Learning Representations, 2024

  24. [24]

    Towards un- bounded machine unlearning

    Meghdad Kurmanji, Peter Triantafillou, Jamie Hayes, and Eleni Triantafillou. Towards un- bounded machine unlearning. Advances in neural information processing systems , 36:1957– 1987, 2023

  25. [25]

    Chundawat, Ayush K

    Vikram S. Chundawat, Ayush K. Mandal, Murari Ahmad, Xuanlong Wu, and Mohan Kankan- halli. Can bad teaching induce forgetting? unlearning in deep networks using an incompetent teacher. In Proceedings of the AAAI Conference on Artificial Intelligence , volume 37, pages 7210–7218, 2023

  26. [26]

    Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary

    Min Chen, Zhiwei Gao, Gaoyang Liu, Kai Peng, and Chen Wang. Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 7766–7775, 2023

  27. [27]

    Fast machine unlearning without retraining through selective synaptic dampening

    Jack Foster, Stefan Schoepf, and Alexandra Brintrup. Fast machine unlearning without retraining through selective synaptic dampening. In Proceedings of the AAAI Conference on Artificial Intelligence, volume 38, pages 12043–12051, 2024

  28. [28]

    Mixed-privacy forgetting in deep networks

    Aditya Golatkar, Alessandro Achille, Avinash Ravichandran, Marzia Polito, and Stefano Soatto. Mixed-privacy forgetting in deep networks. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 792–801, 2021

  29. [29]

    Lora: Low-rank adaptation of large language models

    Edward J Hu, Yelong Shen, Phillip Wallis, Zeyuan Allen-Zhu, Yuanzhi Li, Shean Wang, Liang Wang, Weizhu Chen, et al. Lora: Low-rank adaptation of large language models. Iclr, 1(2):3, 2022

  30. [30]

    Learning multiple layers of features from tiny images

    Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report, University of Toronto, 2009

  31. [31]

    Lawrence Zitnick, and Devi Parikh

    Stanislaw Antol, Aishwarya Agrawal, Jiasen Lu, Margaret Mitchell, Dhruv Batra, C. Lawrence Zitnick, and Devi Parikh. VQA: Visual question answering. In Proceedings of the IEEE International Conference on Computer Vision, pages 2425–2433, 2015

  32. [32]

    Detecting pretraining data from large language models

    Weijia Shi, Anirudh Ajith, Mengzhou Xia, and et al. Detecting pretraining data from large language models. In International Conference on Learning Representations (ICLR), 2023

  33. [33]

    Membership inference attacks against machine learning models

    Reza Shokri, Marco Stronati, Congzheng Song, and Vitaly Shmatikov. Membership inference attacks against machine learning models. In 2017 IEEE Symposium on Security and Privacy (SP), pages 3–18. IEEE, 2017. 12

  34. [34]

    Membership inference attacks from first principles

    Nicholas Carlini, Steve Chien, Milad Nasr, Shuang Song, Andreas Terzis, and Florian Tramer. Membership inference attacks from first principles. In 2022 IEEE Symposium on Security and Privacy (SP), pages 1897–1914. IEEE, 2022

  35. [35]

    Nicolas Papernot, Patrick McDaniel, Arunesh Sinha, and Michael P. Wellman. SoK: Security and privacy in machine learning. In 2018 IEEE European Symposium on Security and Privacy (EuroS&P), pages 399–414. IEEE, 2018

  36. [36]

    ACDC: The adverse conditions dataset with correspondences for robust semantic driving scene perception,

    Micah Goldblum, Dimitris Tsipras, Chulin Xie, Xinyun Chen, Avi Schwarzschild, Dawn Song, Aleksander Madry, Bo Li, and Tom Goldstein. Dataset Security for Machine Learning: Data Poisoning, Backdoor Attacks, and Defenses . IEEE Transactions on Pattern Analysis & Machine Intelligence , 45(02):1563–1580, February 2023. ISSN 1939 -3539. doi: 10.1109/ TPAMI.202...

  37. [37]

    Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell

    James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, An- drei A. Rusu, Kieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, Demis Hassabis, Claudia Clopath, Dharshan Kumaran, and Raia Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the National Academy of Sciences , 114(13): ...

  38. [38]

    Locating and editing factual associations in GPT

    Kevin Meng, David Bau, Alex Andonian, and Yonatan Belinkov. Locating and editing factual associations in GPT. In Advances in Neural Information Processing Systems, volume 35, pages 17359–17372, 2022

  39. [39]

    Eric Mitchell, Charles Lin, Antoine Bosselut, Chelsea Finn, and Christopher D. Manning. Memory-based model editing at scale. In International Conference on Machine Learning, pages 15817–15831, 2022

  40. [40]

    BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models

    Junnan Li, Dongxu Li, Silvio Savarese, and Steven Hoi. BLIP-2: Bootstrapping language-image pre-training with frozen image encoders and large language models. In International Conference on Machine Learning, pages 19730–19742, 2023

  41. [41]

    Learning transferable visual models from natural language supervision

    Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agar- wal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. In Interna- tional Conference on Machine Learning, pages 8748–8763, 2021

  42. [42]

    Rank aggregation methods for the web

    Cynthia Dwork, Ravi Kumar, Moni Naor, and Dandapani Sivakumar. Rank aggregation methods for the web. In Proceedings of the 10th international conference on World Wide Web, pages 613–622, 2001

  43. [43]

    A mathematical theory of evidence

    Glenn Shafer. A mathematical theory of evidence. xx, 2020

  44. [44]

    Maurice G. Kendall. A new measure of rank correlation. Biometrika, 30(1/2):81–93, 1938

  45. [45]

    BLEU: a method for automatic evaluation of machine translation

    Kishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. BLEU: a method for automatic evaluation of machine translation. In Proceedings of the 40th Annual Meeting of the Association for Computational Linguistics, pages 311–318, 2002

  46. [46]

    Weinberger, and Yoav Artzi

    Tianyi Zhang, Varsha Kishore, Felix Wu, Kilian Q. Weinberger, and Yoav Artzi. BERTScore: Evaluating text generation with BERT. In International Conference on Learning Representa- tions, 2020

  47. [47]

    Calibrating noise to sensitivity in private data analysis

    Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analysis. In Theory of Cryptography Conference, pages 265 –284. Springer, 2006. 13 A Implementation Details Hardware: RTX 4090 (24GB), Windows 11, Python 3.13, PyTorch 2.x, HuggingFace Transformers, PEFT. Training: AdamW (η = 2 × 10−5), batch 2, F...

  48. [48]

    1-feature single-split LR (original protocol): max-confidence only, single 80/20 split

  49. [49]

    3-feature CV LR (our protocol): confidence + entropy + margin, 5-fold CV

  50. [50]

    Rank correlation (Kendall’s τ ) between rankings from all three protocols across 4 unlearning methods: τ1-feat, 3-feat = 0.83, τ3-feat, LiRA = 0.83, τ1-feat, LiRA = 0.67

    Shadow-model LiRA: 4 shadow models, likelihood ratio attack. Rank correlation (Kendall’s τ ) between rankings from all three protocols across 4 unlearning methods: τ1-feat, 3-feat = 0.83, τ3-feat, LiRA = 0.83, τ1-feat, LiRA = 0.67. The 3 -feature CV protocol agrees substantially better with LiRA than the 1-feature protocol, validating its use as the prima...

  51. [51]

    These claims are supported by controlled experiments across multiple datasets (MLL MU-Bench, UnLOK- VQA, MMUBench, CIFAR-10) and multiple unlearning methods

    Claims Question: Do the main claims made in the abstract and introduction accurately reflect the paper’s contributions and scope? Answer: [Yes] Justification: The paper makes three primary claims: (i) that standard machine unlearning metrics (FA, RA, MIA, AD, JS) can produce contradictory rankings of methods, (ii) that this contradiction arises from a fun...

  52. [52]

    Limitations Question: Does the paper discuss the limitations of the work? Answer: [Yes] Justification: The limitations are discussed in Section 7. These include: (i) evaluation primarily on a single model family (LLaVA-1.5-7B), (ii) dependence of UQS weights on dataset and model distribution, requiring re -calibration for new settings, (iii) limited scala...

  53. [53]

    Instead, it provides empirical analysis and data-driven methodology (UQS)

    Theory Assumptions and Proofs Question: For each theoretical result, does the paper provide the full set of assumptions and a complete proof? Answer: [N/A] Justification: The paper does not present formal theorems or proofs. Instead, it provides empirical analysis and data-driven methodology (UQS). Theoretical claims are limited to 15 the interpretation o...

  54. [54]

    Appendix A provides full implementation details, including hyperparameters, training setup, and evaluation scripts

    Experimental Result Reproducibility Question: Does the paper fully disclose all the information needed to reproduce the main experimental results of the paper to the extent that it affects the main claims? Answer: [Yes] Justification: The paper specifies all necessary components for reproducibility, including: (i) datasets used (MLLMU-Bench, UnLOK-VQA, MM...

  55. [55]

    The datasets used are publicly available benchmarks

    Open access to data and code Question: Does the paper provide open access to the data and code used in the paper? Answer: [Yes] Justification: The codebase, trained checkpoints, and evaluation pipeline are released through the project repository. The datasets used are publicly available benchmarks. The repository includes instructions for reproducing the ...

  56. [56]

    Experimental Setting/Details Question: Does the paper specify all the training and test details (e.g., data splits, hyperpa- rameters, how they were chosen) necessary to understand the results? Answer: [Yes] Justification: The experimental setup is described in Sections 3- 5 and Appendix A. This includes dataset construction, model architecture (vision -l...

  57. [57]

    Stability of UQS is evaluated via repeated trials (100 runs), reported as mean ± standard deviation

    Experiment Statistical Significance Question: Does the paper report error bars suitably and correctly defined and not misleading? Answer: [Yes] Justification: The paper reports statistical measures including Spearman correlation coef - ficients (ρ) with corresponding p-values (e.g., n = 36). Stability of UQS is evaluated via repeated trials (100 runs), re...

  58. [58]

    This information is sufficient to estimate computational requirements for reproducing the experiments

    Experiments Compute Resources Question: For each experiment, does the paper provide sufficient information on the compu- tational resources (type of compute, computation hours) used? Answer: [Yes] Justification: The paper specifies hardware used (e.g., RTX 4090 GPU with 24GB memory), training setup, and approximate runtime in Appendix A. This information ...

  59. [59]

    It aligns with responsible AI research practices

    Code Of Ethics 16 Question: Does the research conducted in the paper conform, in every respect, with the NeurIPS Code of Ethics? Answer: [Yes] Justification: The work focuses on evaluation methodology for machine unlearning and does not involve human subjects, personal data, or sensitive applications. It aligns with responsible AI research practices. Guid...

  60. [60]

    The work improves evaluation standards, potentially leading to stronger privacy guarantees

    Broader Impacts Question: Does the paper discuss both potential positive societal impacts and negative societal impacts of the work? Answer: [Yes] Justification: Section 6 discusses implications for privacy-preserving machine learning and regulatory compliance (e.g., GDPR). The work improves evaluation standards, potentially leading to stronger privacy gu...

  61. [61]

    Guidelines: [N/A]

    Safeguards Question: Does the paper describe safeguards that have been put in place for responsible release of data or models? Answer: [N/A] Justification: The work does not introduce new datasets or systems that raise safety or misuse concerns. Guidelines: [N/A]

  62. [62]

    Guidelines: [N/A]

    Licenses for existing assets Question: Are the licenses of all assets used in the paper compatible with the paper’s use? Answer: [Yes] Justification: All datasets and models used are publicly available and licensed for research use (e.g., LLaVA, CIFAR-10, public multimodal benchmarks). Guidelines: [N/A]

  63. [63]

    Guidelines: [N/A]

    New Assets Question: Are new assets introduced in the paper documented and that the documentation is provided? Answer: [Yes] Justification: The benchmark pipeline, evaluation framework, and UQS implementation are documented and released with usage instructions. Guidelines: [N/A]

  64. [64]

    Guidelines: [N/A]

    Crowdsourcing and Research with Human Subjects Question: For crowdsourcing experiments and research with human subjects, does the paper include the full text of instructions given to participants? Answer: [N/A] Justification: No human subjects or crowdsourcing involved. Guidelines: [N/A]

  65. [65]

    Guidelines: [N/A]

    Institutional Review Board (IRB) Approvals or Equivalent for Research with Human Subjects Question: Does the paper describe potential risks incurred by study participants and whether they were obtained? Answer: [N/A] Justification: No human subjects involved. Guidelines: [N/A]