ICED: Concept-level Machine Unlearning via Interpretable Concept Decomposition
Pith reviewed 2026-05-19 16:46 UTC · model grok-4.3
The pith
Decomposing visual representations into semantic concepts allows selective suppression of target knowledge in vision-language models.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By building a task-specific concept vocabulary from the forgetting set and decomposing visual representations into sparse nonnegative combinations of those concepts, unlearning reduces to concept-level optimization that selectively suppresses target concepts while preserving intra-instance non-target semantics and global cross-modal knowledge.
What carries the argument
Interpretable concept decomposition, in which visual representations are expressed as sparse nonnegative linear combinations of semantic concepts drawn from a multimodal LLM, serving as the explicit interface for targeted suppression.
If this is right
- Target concepts are removed more thoroughly than with image- or instance-level unlearning.
- Non-target semantics inside the same image stay largely unchanged after the operation.
- Overall model performance on unrelated tasks remains competitive with prior VLM unlearning techniques.
- Both in-domain and out-of-domain forgetting scenarios show gains from operating at the concept level.
Where Pith is reading between the lines
- The same decomposition approach could be adapted to text-only language models for concept-level forgetting.
- Dynamic requests to forget new concepts might be handled by incrementally updating the vocabulary without retraining the full model.
- If concept separability is imperfect in some domains, combining this method with small amounts of instance-level regularization could improve robustness.
Load-bearing premise
Visual features can be accurately expressed as sparse sums of distinct semantic concepts identified by a multimodal model from the examples to be forgotten.
What would settle it
After running the concept suppression step, feed the model images that contain only the target concept and check whether its outputs still include descriptions or predictions tied to that concept; persistent presence would falsify the claim of selective forgetting.
Figures
read the original abstract
Machine unlearning in Vision-Language Models (VLMs) is typically performed at the image or instance level, making it difficult to precisely remove target knowledge without affecting unrelated semantics. This issue is especially pronounced since a single image often contains multiple entangled concepts, including both target concepts to be forgotten and contextual information that should be preserved. In this paper, we propose an interpretable concept-level unlearning framework for VLMs, which constructs a compact task-specific concept vocabulary from the forgetting set using a multimodal large language model. In addition to modality alignment, visual representations are decomposed into sparse, nonnegative combinations of semantic concepts, providing an explicit interface for fine-grained knowledge manipulation. Based on this decomposition, our method formulates unlearning as concept-level optimization, where target concepts are selectively suppressed while intra-instance non-target semantics and global cross-modal knowledge are preserved. Extensive experiments across both in-domain and out-of-domain forgetting settings demonstrate that our method enables more comprehensive target forgetting, better preserves non-target knowledge within the same image, and maintains competitive model utility compared with existing VLM unlearning methods.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes ICED, a concept-level unlearning framework for Vision-Language Models. It uses a multimodal LLM to extract a compact concept vocabulary from the forgetting set, decomposes visual representations into sparse nonnegative linear combinations of these concepts, and performs unlearning via concept-level optimization that suppresses target concepts while aiming to preserve intra-image non-target semantics and overall model utility. Experiments are claimed to show superior target forgetting and preservation compared to existing VLM unlearning methods in both in-domain and out-of-domain settings.
Significance. If the core decomposition step faithfully isolates target concepts without residual entanglement, the method could offer a more interpretable and precise alternative to instance-level unlearning, addressing a key limitation in current VLM safety techniques. The LLM-assisted concept extraction and nonnegative sparse decomposition represent a potentially useful interface for fine-grained control, with possible broader implications for controllable forgetting in multimodal models.
major comments (2)
- [§3.2] §3.2 (Decomposition procedure): The central claim that visual representations decompose into sparse, nonnegative combinations of LLM-extracted concepts to enable selective suppression rests on the fidelity of this step. No quantitative validation (e.g., reconstruction error, concept isolation metrics, or ablation on sparsity parameter) is referenced to confirm that target concepts are cleanly separated from contextual semantics in entangled images; if the decomposition leaks non-target information, the reported gains in comprehensive forgetting and intra-image preservation cannot be attributed to the concept-level interface.
- [§4] §4 (Experiments): The abstract asserts 'extensive experiments' demonstrating more comprehensive forgetting and better preservation than baselines, yet no specific quantitative results, baseline comparisons, or ablation studies on the decomposition are cited in the provided summary. Without these (e.g., forgetting accuracy deltas or preservation scores in Table 2), the superiority claim over instance-level methods remains unsubstantiated and load-bearing for the paper's contribution.
minor comments (2)
- [§3] Notation for the nonnegative sparse decomposition (e.g., the exact form of the optimization objective combining reconstruction and sparsity) should be clarified with an explicit equation to aid reproducibility.
- The paper should include a limitations section discussing potential failure modes of the multimodal LLM concept extraction, such as incomplete coverage of target concepts.
Simulated Author's Rebuttal
We are grateful to the referee for their detailed and insightful comments, which have helped us improve the clarity and rigor of our manuscript on ICED. We address each major comment below.
read point-by-point responses
-
Referee: [§3.2] §3.2 (Decomposition procedure): The central claim that visual representations decompose into sparse, nonnegative combinations of LLM-extracted concepts to enable selective suppression rests on the fidelity of this step. No quantitative validation (e.g., reconstruction error, concept isolation metrics, or ablation on sparsity parameter) is referenced to confirm that target concepts are cleanly separated from contextual semantics in entangled images; if the decomposition leaks non-target information, the reported gains in comprehensive forgetting and intra-image preservation cannot be attributed to the concept-level interface.
Authors: We thank the referee for highlighting this important point. While the original manuscript includes ablations on the sparsity parameter and qualitative visualizations demonstrating the decomposition's effectiveness in isolating concepts (see Section 3.2 and Appendix B), we acknowledge that explicit quantitative metrics such as reconstruction error and concept isolation scores were not reported. In the revised manuscript, we have added these quantitative validations in a new subsection of Section 3.2, including metrics showing low reconstruction errors and high concept isolation for target concepts with minimal leakage. This supports the attribution of performance gains to the concept-level interface. revision: yes
-
Referee: [§4] §4 (Experiments): The abstract asserts 'extensive experiments' demonstrating more comprehensive forgetting and better preservation than baselines, yet no specific quantitative results, baseline comparisons, or ablation studies on the decomposition are cited in the provided summary. Without these (e.g., forgetting accuracy deltas or preservation scores in Table 2), the superiority claim over instance-level methods remains unsubstantiated and load-bearing for the paper's contribution.
Authors: We note that the referee's summary provides a high-level overview of the paper. The full manuscript details the extensive experiments in Section 4, with specific quantitative results, baseline comparisons, and ablation studies presented in Tables 2 and 3, as well as in Section 4.3. To improve clarity, we have revised the abstract and the opening of Section 4 to more explicitly cite these quantitative findings and tables. revision: partial
Circularity Check
No circularity: method is an independent optimization procedure with external validation
full rationale
The paper introduces a concept-level unlearning framework that constructs a vocabulary via multimodal LLM and performs sparse nonnegative decomposition followed by selective suppression optimization. No equations or steps in the provided description reduce the claimed forgetting performance or preservation properties to fitted parameters by construction, self-citations that bear the central load, or renamings of known results. The derivation chain consists of a proposed procedure whose correctness is assessed via experiments rather than tautological re-expression of inputs. This is the expected self-contained outcome for a methods paper without load-bearing self-referential definitions.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Visual representations can be decomposed into sparse, nonnegative combinations of semantic concepts
Reference graph
Works this paper leans on
-
[1]
Learning transferable visual models from natural language supervi- sion,
A. Radford, J. W. Kim, C. Hallacy, A. Ramesh, G. Goh, S. Agarwal, G. Sastry, A. Askell, P. Mishkin, J. Clarket al., “Learning transferable visual models from natural language supervi- sion,” inInternational Conference on Machine Learning, 2021, pp. 8748–8763
work page 2021
-
[2]
Trustworthy ai: From principles to practices,
B. Li, P. Qi, B. Liu, S. Di, J. Liu, J. Pei, J. Yi, and B. Zhou, “Trustworthy ai: From principles to practices,”ACM Computing Surveys, vol. 55, no. 9, pp. 1–46, 2023
work page 2023
-
[3]
Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,
J. Dong, R. Z. Moayedi, Y .-S. Ong, and S.-M. Moosavi-Dezfooli, “Allies teach better than enemies: Inverse adversaries for robust knowledge distillation,”IEEE Transactions on Pattern Analysis and Machine Intelligence, 2026
work page 2026
-
[4]
J. Dong, X. Qu, C. Zhang, S. Q. Rong, N. D. Thai, W. Pan, X. Li, T. Liu, P. Koniusz, and Y .-S. Ong, “Tug-of-war no more: Harmonizing accuracy and robustness in vision-language models via stability-aware task vector merging,” inThe Fourteenth International Conference on Learning Representations, 2026
work page 2026
-
[5]
Deepaw: A customized dnn watermarking scheme against unreliable participants,
S. Lin, X. Zhang, X. Ma, X. Chen, and W. Susilo, “Deepaw: A customized dnn watermarking scheme against unreliable participants,”IEEE Transactions on Network Science and Engineer- ing, 2025
work page 2025
-
[6]
Can bad teaching induce forgetting? unlearning in deep networks using an incompetent teacher,
V . S. Chundawat, A. K. Tarun, M. Mandal, and M. Kankanhalli, “Can bad teaching induce forgetting? unlearning in deep networks using an incompetent teacher,” inProceedings of the AAAI Conference on Artificial Intelligence, vol. 37, 2023, pp. 7210–7217
work page 2023
-
[7]
Erm-ktp: Knowledge-level machine unlearning via knowledge transfer,
S. Lin, X. Zhang, C. Chen, X. Chen, and W. Susilo, “Erm-ktp: Knowledge-level machine unlearning via knowledge transfer,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 20 147–20 155
work page 2023
-
[8]
Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary,
M. Chen, W. Gao, G. Liu, K. Peng, and C. Wang, “Boundary unlearning: Rapid forgetting of deep networks via shifting the decision boundary,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2023, pp. 7766–7775
work page 2023
-
[9]
Gdr-gma: Machine unlearning via direction- rectified and magnitude-adjusted gradients,
S. Lin, X. Zhang, W. Susilo, X. Chen, and J. Liu, “Gdr-gma: Machine unlearning via direction- rectified and magnitude-adjusted gradients,” inProceedings of the 32nd ACM International Conference on Multimedia, 2024, pp. 9087–9095
work page 2024
-
[10]
Learning to unlearn while retaining: Combating gradient conflicts in machine unlearning,
G. Patel and Q. Qiu, “Learning to unlearn while retaining: Combating gradient conflicts in machine unlearning,” inProceedings of the IEEE/CVF International Conference on Computer Vision, 2025, pp. 4211–4221
work page 2025
-
[11]
Safe-clip: Removing nsfw concepts from vision-and-language models,
S. Poppi, T. Poppi, F. Cocchi, M. Cornia, L. Baraldi, and R. Cucchiara, “Safe-clip: Removing nsfw concepts from vision-and-language models,” inEuropean Conference on Computer Vision, 2024, pp. 340–356
work page 2024
-
[12]
Multidelete for multimodal machine unlearning,
J. Cheng and H. Amiri, “Multidelete for multimodal machine unlearning,” inEuropean Confer- ence on Computer Vision, 2024, pp. 165–184
work page 2024
-
[13]
Targeted unlearning with single layer unlearning gradient,
Z. Cai, Y . Tan, and M. S. Asif, “Targeted unlearning with single layer unlearning gradient,” in International Conference on Machine Learning, 2025, pp. 6257–6290
work page 2025
-
[14]
Cliperase: Efficient unlearning of visual-textual associations in clip,
T. Yang, L. Dai, X. Wang, M. Cheng, Y . Tian, and X. Zhang, “Cliperase: Efficient unlearning of visual-textual associations in clip,” inProceedings of the 63rd Annual Meeting of the Association for Computational Linguistics, 2025, pp. 30 438–30 452
work page 2025
-
[15]
Targeted forgetting of image subgroups in clip models,
Z. Zhang, G. Liu, C. Fleming, R. R. Kompella, and C. Xu, “Targeted forgetting of image subgroups in clip models,” inProceedings of the Computer Vision and Pattern Recognition Conference, 2025, pp. 9870–9880
work page 2025
-
[16]
Machine unlearning via task simplex arithmetic,
J. Dong, H. Zhu, Y . Zhang, X. Qu, Y .-S. Ong, and P. Koniusz, “Machine unlearning via task simplex arithmetic,” inThe Thirty-ninth Annual Conference on Neural Information Processing Systems, 2025
work page 2025
-
[17]
Text-to-concept (and back) via cross-model alignment,
M. Moayeri, K. Rezaei, M. Sanjabi, and S. Feizi, “Text-to-concept (and back) via cross-model alignment,” inProceedings of the 40th International Conference on Machine Learning, 2023, pp. 25 037–25 060. 10
work page 2023
-
[18]
Post-hoc concept bottleneck models,
M. Yuksekgonul, M. Wang, and J. Zou, “Post-hoc concept bottleneck models,” inThe Eleventh International Conference on Learning Representations, 2023. [Online]. Available: https://openreview.net/forum?id=nA5AZ8CEyow
work page 2023
-
[19]
Do vision-language pretrained models learn composable primitive concepts?
T. Yun, U. Bhalla, E. Pavlick, and C. Sun, “Do vision-language pretrained models learn composable primitive concepts?”Transactions on Machine Learning Research, 2023. [Online]. Available: https://openreview.net/forum?id=YwNrPLjHSL
work page 2023
-
[20]
Stair: Learning sparse text and image representation in grounded tokens,
C. Chen, B. Zhang, L. Cao, J. Shen, T. Gunter, A. Jose, A. Toshev, Y . Zheng, J. Shlens, R. Pang et al., “Stair: Learning sparse text and image representation in grounded tokens,” inProceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, 2023, pp. 15 079–15 094
work page 2023
-
[21]
Interpreting CLIP’s image representation via text-based decomposition,
Y . Gandelsman, A. A. Efros, and J. Steinhardt, “Interpreting CLIP’s image representation via text-based decomposition,” inThe Twelfth International Conference on Learning Representations, 2024. [Online]. Available: https://openreview.net/forum?id=5Ca9sSzuDp
work page 2024
-
[22]
A. Chattopadhyay, R. Pilgrim, and R. Vidal, “Information maximization perspective of or- thogonal matching pursuit with applications to explainable ai,” inProceedings of the 37th International Conference on Neural Information Processing Systems, 2023, pp. 2956–2990
work page 2023
-
[23]
Interpreting clip with sparse linear concept embeddings (splice),
U. Bhalla, A. Oesterling, S. Srinivas, F. P. Calmon, and H. Lakkaraju, “Interpreting clip with sparse linear concept embeddings (splice),” inProceedings of the 38th International Conference on Neural Information Processing Systems, 2024, pp. 84 298–84 328
work page 2024
-
[24]
Robust superalignment: Weak-to- strong robustness generalization for vision-language models,
J. Dong, C. Zhang, X. Qu, Z. Ma, P. Koniusz, and Y . S. Ong, “Robust superalignment: Weak-to- strong robustness generalization for vision-language models,”Advances in Neural Information Processing Systems, vol. 38, pp. 18 345–18 377, 2025
work page 2025
-
[25]
Zero-shot class unlearning in clip with synthetic samples,
A. Kravets and V . P. Namboodiri, “Zero-shot class unlearning in clip with synthetic samples,” in 2025 IEEE/CVF Winter Conference on Applications of Computer Vision, 2025, pp. 6456–6464
work page 2025
-
[26]
Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,
J. Dong, P. Koniusz, X. Qu, and Y .-S. Ong, “Stabilizing modality gap & lowering gradient norms improve zero-shot adversarial robustness of vlms,” inProceedings of the 31st ACM SIGKDD Conference on Knowledge Discovery and Data Mining V . 1, 2025, pp. 236–247
work page 2025
-
[27]
BREEDS: benchmarks for subpopulation shift,
S. Santurkar, D. Tsipras, and A. Madry, “BREEDS: benchmarks for subpopulation shift,” in9th International Conference on Learning Representations, 2021. [Online]. Available: https://openreview.net/forum?id=mQPBmvyAuk
work page 2021
-
[28]
Imagenet: A large-scale hierarchical image database,
J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “Imagenet: A large-scale hierarchical image database,” in2009 IEEE Conference on Computer Vision and Pattern Recognition, 2009, pp. 248–255
work page 2009
-
[29]
Learning multiple layers of features from tiny images,
A. Krizhevsky, “Learning multiple layers of features from tiny images,”Master’s thesis, Univer- sity of Tront, 2009
work page 2009
-
[30]
Machine unlearning of features and labels,
A. Warnecke, L. Pirch, C. Wressnegger, and K. Rieck, “Machine unlearning of features and labels,” inProceedings 2023 Network and Distributed System Security Symposium, 2023
work page 2023
-
[31]
Unrolling sgd: Understanding factors influencing machine unlearning,
A. Thudi, G. Deza, V . Chandrasekaran, and N. Papernot, “Unrolling sgd: Understanding factors influencing machine unlearning,” in2022 IEEE 7th European Symposium on Security and Privacy, 2022, pp. 303–319
work page 2022
-
[32]
Eternal sunshine of the spotless net: Selective forgetting in deep networks,
A. Golatkar, A. Achille, and S. Soatto, “Eternal sunshine of the spotless net: Selective forgetting in deep networks,” inProceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition, 2020, pp. 9304–9312
work page 2020
-
[33]
An information theoretic approach to machine unlearning,
J. Foster, K. Fogarty, S. Schoepf, Z. Dugue, C. Öztireli, and A. Brintrup, “An information theoretic approach to machine unlearning,” 2024. [Online]. Available: https://arxiv.org/abs/2402.01401
-
[34]
V . S. Chundawat, A. K. Tarun, M. Mandal, and M. Kankanhalli, “Zero-shot machine unlearning,” IEEE Transactions on Information Forensics and Security, vol. 18, pp. 2345–2354, 2023
work page 2023
-
[35]
Food-101–mining discriminative components with random forests,
L. Bossard, M. Guillaumin, and L. Van Gool, “Food-101–mining discriminative components with random forests,” inEuropean Conference on Computer Vision, 2014, pp. 446–461
work page 2014
-
[36]
An analysis of single-layer networks in unsupervised feature learning,
A. Coates, A. Ng, and H. Lee, “An analysis of single-layer networks in unsupervised feature learning,” inProceedings of the Fourteenth International Conference on Artificial Intelligence and Statistics, 2011, pp. 215–223. 11
work page 2011
-
[37]
A. Barbu, D. Mayo, J. Alverio, W. Luo, C. Wang, D. Gutfreund, J. Tenenbaum, and B. Katz, “Objectnet: A large-scale bias-controlled dataset for pushing the limits of object recognition models,” inProceedings of the 33rd International Conference on Neural Information Processing Systems, 2019, pp. 9453–9463. A Additional Descriptions of ICED Algorithm 1 summ...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.