Recognition: unknown
Bias Redistribution in Visual Machine Unlearning: Does Forgetting One Group Harm Another?
Pith reviewed 2026-05-10 17:11 UTC · model grok-4.3
The pith
Machine unlearning in CLIP models does not remove demographic bias but shifts it along gender lines instead of age lines.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Unlearning does not eliminate bias but redistributes it primarily along gender rather than age boundaries. In particular, removing the dominant Young Female group consistently transfers performance to Old Female across all model scales, revealing a gender-dominant structure in CLIP's embedding space. While the Refusal Vector method reduces redistribution, it fails to achieve complete forgetting and significantly degrades retained performance.
What carries the argument
The redistribution score, which measures per-group accuracy shifts and demographic parity gaps after unlearning, exposes the gender-dominant structure that organizes CLIP's embedding space.
If this is right
- Unlearning one intersectional group can increase accuracy on a correlated group of the same gender.
- Refusal Vector reduces the amount of redistribution compared with Prompt Erasure and Prompt Reweighting.
- Current unlearning methods risk amplifying bias in retained groups when embedding geometry is ignored.
- Complete forgetting without performance degradation on other groups remains unachieved by the tested methods.
Where Pith is reading between the lines
- Fairness evaluations of unlearning should include explicit checks for cross-group performance transfers rather than only measuring removal success.
- Embedding-space geometry may need to be directly modeled or regularized in future unlearning algorithms to isolate concepts more cleanly.
- The gender-dominant pattern could appear in other multimodal models whose training data contain similar demographic correlations.
Load-bearing premise
The observed performance transfers reflect a general property of the embedding geometry rather than artifacts of the three specific unlearning methods, the redistribution score definition, or the age-gender correlations present in CelebA.
What would settle it
Repeating the unlearning experiments on a dataset where age and gender attributes are statistically independent and finding no consistent accuracy gain on the Old Female group after unlearning Young Female would falsify the redistribution claim.
Figures
read the original abstract
Machine unlearning enables models to selectively forget training data, driven by privacy regulations such as GDPR and CCPA. However, its fairness implications remain underexplored: when a model forgets a demographic group, does it neutralize that concept or redistribute it to correlated groups, potentially amplifying bias? We investigate this bias redistribution phenomenon on CelebA using CLIP models (ViT/B-32, ViT-L/14, ViT-B/16) under a zero-shot classification setting across intersectional groups defined by age and gender. We evaluate three unlearning methods, Prompt Erasure, Prompt Reweighting, and Refusal Vector using per-group accuracy shifts, demographic parity gaps, and a redistribution score. Our results show that unlearning does not eliminate bias but redistributes it primarily along gender rather than age boundaries. In particular, removing the dominant Young Female group consistently transfers performance to Old Female across all model scales, revealing a gender-dominant structure in CLIP's embedding space. While the Refusal Vector method reduces redistribution, it fails to achieve complete forgetting and significantly degrades retained performance. These findings highlight a fundamental limitation of current unlearning methods: without accounting for embedding geometry, they risk amplifying bias in retained groups.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript investigates bias redistribution in machine unlearning for CLIP vision-language models on the CelebA dataset under zero-shot classification. It evaluates three unlearning methods (Prompt Erasure, Prompt Reweighting, Refusal Vector) across ViT variants and intersectional age-gender groups, reporting that unlearning does not eliminate bias but redistributes it primarily along gender boundaries. In particular, forgetting the dominant Young Female group transfers performance to Old Female, which the authors interpret as evidence of a gender-dominant structure in CLIP's embedding space. Metrics include per-group accuracy shifts, demographic parity gaps, and a redistribution score.
Significance. If the gender-dominant redistribution pattern is shown to be intrinsic to embedding geometry rather than method- or data-specific, the result would be significant for fairness research in unlearning and multimodal models. The work merits credit for its consistent empirical patterns across three model scales and three distinct unlearning methods, which strengthens the reporting of the observed transfers. However, the broader implications for embedding geometry and method limitations remain provisional without further controls.
major comments (2)
- [Results] The central claim that unlearning reveals a 'gender-dominant structure in CLIP's embedding space' (abstract) is load-bearing on the interpretation of the Young Female to Old Female performance transfer. This interpretation is not yet supported by ablations that would isolate embedding geometry from CelebA label correlations or from artifacts introduced by the three chosen unlearning methods.
- [Evaluation Metrics] The redistribution score and accuracy-shift results are presented without reported statistical significance tests, exact formula for the score, or controls for confounding factors in the age-gender intersections (abstract and implied experimental sections). These omissions directly affect the strength of the redistribution-along-gender conclusion.
minor comments (3)
- Details on data splits, training/validation protocols, and exact hyperparameter choices for the three unlearning methods are missing and would improve reproducibility.
- The manuscript would benefit from a table or figure explicitly showing the redistribution score for all group pairs and all methods to allow direct comparison of gender vs. age effects.
- Clarify whether the demographic parity gaps are computed on the full test set or only on retained groups after unlearning.
Simulated Author's Rebuttal
Thank you for your thorough review and valuable feedback on our manuscript. We have carefully considered each of the major comments and provide point-by-point responses below. We outline the revisions we intend to make to address the concerns raised.
read point-by-point responses
-
Referee: [Results] The central claim that unlearning reveals a 'gender-dominant structure in CLIP's embedding space' (abstract) is load-bearing on the interpretation of the Young Female to Old Female performance transfer. This interpretation is not yet supported by ablations that would isolate embedding geometry from CelebA label correlations or from artifacts introduced by the three chosen unlearning methods.
Authors: We thank the referee for highlighting this important aspect of our interpretation. The observed performance transfer from Young Female to Old Female is consistent across three different unlearning methods and three model scales, which we believe provides evidence against method-specific artifacts. Regarding CelebA label correlations, the fact that the transfer occurs within the same gender but across age groups, rather than to male groups, supports our gender-dominant structure hypothesis. However, we acknowledge that explicit ablations isolating embedding geometry would further strengthen this claim. In the revised manuscript, we will add a new subsection discussing potential confounding factors from dataset correlations and include an analysis of embedding similarities between group prototypes to better isolate the geometric effects. revision: partial
-
Referee: [Evaluation Metrics] The redistribution score and accuracy-shift results are presented without reported statistical significance tests, exact formula for the score, or controls for confounding factors in the age-gender intersections (abstract and implied experimental sections). These omissions directly affect the strength of the redistribution-along-gender conclusion.
Authors: We agree that including statistical significance tests and the exact formula for the redistribution score will improve the rigor of our results. We will explicitly provide the formula for the redistribution score in the main text. We will also report p-values using paired statistical tests for accuracy shifts across runs and add controls by analyzing the correlation between group transfers and label co-occurrences in CelebA. These changes will be incorporated in the revised experimental and results sections. revision: yes
Circularity Check
No circularity: purely empirical measurements with no derivations or self-referential quantities
full rationale
The paper conducts an empirical study evaluating three unlearning methods (Prompt Erasure, Prompt Reweighting, Refusal Vector) on CLIP ViT variants using CelebA intersectional groups. It reports per-group accuracy shifts, demographic parity gaps, and a redistribution score. No equations, derivations, fitted parameters, or self-citations are invoked as load-bearing steps in any claimed result. The central observation (performance transfer from Young Female to Old Female) is presented as a measured outcome rather than a quantity forced by definition or prior self-citation. The work is self-contained against external benchmarks and does not reduce any prediction to its own inputs.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The three evaluated unlearning methods are representative of current approaches for studying bias effects.
Reference graph
Works this paper leans on
-
[1]
Re- fusal in language models is mediated by a single direction
Andy Arditi, Oscar Obeso, Aaquib Syed, Daniel Paleka, Nina Panickssery, Wes Gurnee, and Neel Nanda. Re- fusal in language models is mediated by a single direction. Advances in Neural Information Processing Systems, 37: 136037–136083, 2024. 4
2024
-
[2]
Leace: Perfect linear concept erasure in closed form
Nora Belrose, David Schneider-Joseph, Shauli Ravfogel, Ryan Cotterell, Edward Raff, and Stella Biderman. Leace: Perfect linear concept erasure in closed form. InAdvances in Neural Information Processing Systems, pages 66044– 66063, 2023. 2, 4, 7
2023
-
[3]
Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot
Lucas Bourtoule, Varun Chandrasekaran, Christopher A. Choquette-Choo, Hengrui Jia, Adelin Travers, Baiwu Zhang, David Lie, and Nicolas Papernot. Machine unlearning. In 2021 IEEE Symposium on Security and Privacy (SP), pages 141–159. IEEE, 2021. 1, 2, 3
2021
-
[4]
Gender shades: Inter- sectional accuracy disparities in commercial gender classifi- cation
Joy Buolamwini and Timnit Gebru. Gender shades: Inter- sectional accuracy disparities in commercial gender classifi- cation. InProceedings of the Conference on Fairness, Ac- countability and Transparency (FAT*), pages 77–91. PMLR,
-
[5]
Deep unlearn: Benchmarking machine unlearning for image classification
Xavier F Cadet, Anastasia Borovykh, Mohammad Malekzadeh, Sara Ahmadi-Abhari, and Hamed Had- dadi. Deep unlearn: Benchmarking machine unlearning for image classification. In2025 IEEE 10th European Symposium on Security and Privacy (EuroS&P), pages 939–962. IEEE, 2025. 2
2025
-
[6]
Towards making systems for- get with machine unlearning
Yinzhi Cao and Junfeng Yang. Towards making systems for- get with machine unlearning. In2015 IEEE Symposium on Security and Privacy (SP), pages 463–480. IEEE, 2015. 1, 2
2015
-
[7]
Group-robust machine un- learning.arXiv preprint arXiv:2503.09330, 2025
Thomas De Min, Subhankar Roy, St´ephane Lathuili`ere, Elisa Ricci, and Massimiliano Mancini. Group-robust machine un- learning.arXiv preprint arXiv:2503.09330, 2025. 3
-
[8]
Calibrating noise to sensitivity in private data analy- sis
Cynthia Dwork, Frank McSherry, Kobbi Nissim, and Adam Smith. Calibrating noise to sensitivity in private data analy- sis. InTheory of cryptography conference, pages 265–284. Springer, 2006. 3
2006
-
[9]
Fairness through awareness
Cynthia Dwork, Moritz Hardt, Toniann Pitassi, Omer Rein- gold, and Richard Zemel. Fairness through awareness. In Proceedings of the 3rd innovations in theoretical computer science conference, pages 214–226, 2012. 3
2012
-
[10]
arXiv preprint arXiv:2310.12508 (2023)
Chongyu Fan, Jiancheng Liu, Yihua Zhang, Eric Wong, Den- nis Wei, and Sijia Liu. Salun: Empowering machine unlearn- ing via gradient-based weight saliency in both image classi- fication and generation.arXiv preprint arXiv:2310.12508,
-
[11]
Fast machine unlearning without retraining through selective synaptic dampening
Jack Foster, Stefan Schoepf, and Alexandra Brintrup. Fast machine unlearning without retraining through selective synaptic dampening. InProceedings of the AAAI conference on artificial intelligence, pages 12043–12051, 2024. 2
2024
-
[12]
Erasing concepts from diffu- sion models
Rohit Gandikota, Joanna Materzynska, Jaden Fiotto- Kaufman, and David Bau. Erasing concepts from diffu- sion models. InProceedings of the IEEE/CVF International Conference on Computer Vision (ICCV), pages 2426–2436,
-
[13]
Amne- siac machine learning
Laura Graves, Vineel Nagisetty, and Vijay Ganesh. Amne- siac machine learning. InProceedings of the AAAI Confer- ence on Artificial Intelligence, pages 11516–11524, 2021. 2, 3
2021
-
[14]
Facet: Fairness in computer vision evaluation benchmark
Laura Gustafson, Chloe Rolland, Nikhila Ravi, Quentin Du- val, Aaron Adcock, Cheng-Yang Fu, Melissa Hall, and Can- dace Ross. Facet: Fairness in computer vision evaluation benchmark. InProceedings of the IEEE/CVF international conference on computer vision, pages 20370–20382, 2023. 2
2023
-
[15]
Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation
Kimmo Karkkainen and Jungseock Joo. Fairface: Face attribute dataset for balanced race, gender, and age for bias measurement and mitigation. InProceedings of the IEEE/CVF winter conference on applications of computer vision, pages 1548–1558, 2021. 2
2021
-
[16]
Preventing fairness gerrymandering: Auditing and learning for subgroup fairness
Michael Kearns, Seth Neel, Aaron Roth, and Zhiwei Steven Wu. Preventing fairness gerrymandering: Auditing and learning for subgroup fairness. InProceedings of the In- ternational Conference on Machine Learning (ICML), pages 2564–2572. PMLR, 2018. 2, 3
2018
-
[17]
Towards unbounded machine unlearn- ing
Meghdad Kurmanji, Peter Triantafillou, Jamie Hayes, and Eleni Triantafillou. Towards unbounded machine unlearn- ing. InAdvances in Neural Information Processing Systems, pages 1957–1987, 2023. 2, 3, 8
1957
-
[18]
Deep learning face attributes in the wild
Ziwei Liu, Ping Luo, Xiaogang Wang, and Xiaoou Tang. Deep learning face attributes in the wild. InProceedings of the IEEE International Conference on Computer Vision (ICCV), pages 3730–3738, 2015. 2, 3, 4
2015
-
[19]
A survey on feder- ated unlearning: Challenges, methods, and future directions
Ziyao Liu, Yu Jiang, Jiyuan Shen, Minyi Peng, Kwok-Yan Lam, Xingliang Yuan, and Xiaoning Liu. A survey on feder- ated unlearning: Challenges, methods, and future directions. ACM Computing Surveys, 57(1):1–38, 2024. 2
2024
-
[20]
Fairclip: Har- nessing fairness in vision-language learning
Yan Luo, Min Shi, Muhammad Osama Khan, Muham- mad Muneeb Afzal, Hao Huang, Shuaihang Yuan, Yu Tian, Luo Song, Ava Kouhana, Tobias Elze, et al. Fairclip: Har- nessing fairness in vision-language learning. InProceedings of the IEEE/CVF conference on computer vision and pattern recognition, pages 12289–12301, 2024. 3
2024
-
[21]
Learning transferable visual models from natural language supervision
Alec Radford, Jong Wook Kim, Chris Hallacy, Aditya Ramesh, Gabriel Goh, Sandhini Agarwal, Girish Sastry, Amanda Askell, Pamela Mishkin, Jack Clark, Gretchen Krueger, and Ilya Sutskever. Learning transferable visual models from natural language supervision. InProceedings of the International Conference on Machine Learning (ICML), pages 8748–8763. PMLR, 202...
2021
-
[22]
Null it out: Guarding protected at- tributes by iterative nullspace projection
Shauli Ravfogel, Yanai Elazar, Hila Gonen, Michael Twiton, and Yoav Goldberg. Null it out: Guarding protected at- tributes by iterative nullspace projection. InProceedings of the 58th annual meeting of the association for computational linguistics, pages 7237–7256, 2020. 2
2020
-
[23]
Dear: Debiasing vision-language models with additive residuals
Ashish Seth, Mayur Hemani, and Chirag Agarwal. Dear: Debiasing vision-language models with additive residuals. InProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, pages 6820–6829, 2023. 3
2023
-
[24]
Datasetagent: A novel multi-agent system for auto-constructing datasets from real-world images
Haoran Sun, Haoyu Bian, Shaoning Zeng, Yunbo Rao, Xu Xu, Lin Mei, and Jianping Gou. Datasetagent: A novel multi-agent system for auto-constructing datasets from real- world images.arXiv preprint arXiv:2507.08648, 2025. 3
-
[25]
Visualizing data using t-sne.Journal of machine learning research, 9 (11), 2008
Laurens Van der Maaten and Geoffrey Hinton. Visualizing data using t-sne.Journal of machine learning research, 9 (11), 2008. 6
2008
-
[26]
Springer International Publishing, Cham, 1 edition, 2017
Paul V oigt and Axel V on dem Bussche.The EU General Data Protection Regulation (GDPR): A Practical Guide. Springer International Publishing, Cham, 1 edition, 2017. 1, 3, 8
2017
-
[27]
Junyang Wang, Yi Zhang, and Jitao Sang. Fairclip: So- cial bias elimination based on attribute prototype learn- ing and representation neutralization.arXiv preprint arXiv:2210.14562, 2022. 3
-
[28]
Sd-psfnet: Sequential and dynamic point spread function network for image deraining
Jiayu Wang, Haoyu Bian, Haoran Sun, and Shaoning Zeng. Sd-psfnet: Sequential and dynamic point spread function network for image deraining. InProceedings of the AAAI Conference on Artificial Intelligence, pages 9921–9929,
-
[29]
Spiking vision transformer with saccadic attention.arXiv preprint arXiv:2502.12677, 2025
Shuai Wang, Malu Zhang, Dehao Zhang, Ammar Bela- treche, Yichen Xiao, Yu Liang, Yimeng Shan, Qian Sun, Enqi Zhang, and Yang Yang. Spiking vision transformer with saccadic attention.arXiv preprint arXiv:2502.12677, 2025. 3
-
[30]
Machine unlearning of features and labels
Alexander Warnecke et al. Machine unlearning of features and labels. InNetwork and Distributed System Security Sym- posium (NDSS), 2023. 2, 4
2023
-
[31]
arXiv preprint arXiv:2602.09794 , year=
Jiaquan Zhang, Chaoning Zhang, Shuxu Chen, Xudong Wang, Zhenzhen Huang, Pengcheng Zheng, Shuai Yuan, Sheng Zheng, Qigan Sun, Jie Zou, et al. Learning global hypothesis space for enhancing synergistic reasoning chain. arXiv preprint arXiv:2602.09794, 2026. 3
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.