CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization
Pith reviewed 2026-05-10 08:46 UTC · model grok-4.3
The pith
CiPO removes undesired knowledge from both intermediate reasoning steps and final answers in large reasoning models by iteratively optimizing preferences toward valid counterfactual traces while keeping overall reasoning performance intact.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.
Load-bearing premise
That instructing the model to generate logically valid counterfactual reasoning traces and iteratively increasing their discrepancy from the original model will achieve complete unlearning of the target knowledge without side effects on unrelated reasoning tasks.
Figures
read the original abstract
Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.
Editorial analysis
A structured set of objections, weighed in public.
Axiom & Free-Parameter Ledger
axioms (2)
- domain assumption LRMs can be instructed to produce logically valid counterfactual reasoning traces for any given unlearning target.
- ad hoc to paper Iterative updates to preference data will monotonically increase discrepancy from the original model without introducing new unwanted behaviors.
Reference graph
Works this paper leans on
-
[1]
Guangsheng Bao, Hongbo Zhang, Cunxiang Wang, Linyi Yang, and Yue Zhang. 2025. https://aclanthology.org/2025.coling-main.524/ How likely do LLM s with C o T mimic human reasoning? In Proceedings of the 31st International Conference on Computational Linguistics, pages 7831--7850, Abu Dhabi, UAE. Association for Computational Linguistics
2025
-
[2]
Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463--480. IEEE
2015
- [3]
-
[4]
DeepSeek-AI. 2024. https://arxiv.org/abs/2412.19437 Deepseek-v3 technical report . Preprint, arXiv:2412.19437
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[5]
DeepSeek-AI. 2025. https://arxiv.org/abs/2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning . Preprint, arXiv:2501.12948
work page internal anchor Pith review Pith/arXiv arXiv 2025
-
[6]
Vineeth Dorna, Anmol Reddy Mekala, Wenlong Zhao, Andrew McCallum, J Zico Kolter, Zachary Chase Lipton, and Pratyush Maini. 2025. https://openreview.net/forum?id=Gy67Zh5X1i Openunlearning: Accelerating LLM unlearning via unified benchmarking of methods and metrics . In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and...
2025
- [7]
-
[8]
Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. 2025. https://openreview.net/forum?id=JbvSQm5h1l Simplicity prevails: Rethinking negative preference optimization for LLM unlearning . In The Thirty-ninth Annual Conference on Neural Information Processing Systems
2025
-
[9]
Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://doi.org/10.5281/zenodo.12608602 The languag...
-
[10]
Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[11]
Tommaso Green, Martin Gubri, Haritz Puerto, Sangdoo Yun, and Seong Joon Oh. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1347 Leaky thoughts: Large reasoning models are not private thinkers . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26507--26529, Suzhou, China. Association for Computational Linguistics
- [12]
-
[13]
Dang Huu-Tien, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue. 2025. https://doi.org/10.1609/aaai.v39i22.34544 On effects of steering latent representation for large language model unlearning . In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and ...
-
[14]
Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. 2023. https://doi.org/10.18653/v1/2023.acl-long.805 Knowledge unlearning for mitigating privacy risks in language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14389--14...
-
[15]
Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders S gaard. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.458 Copyright violations and large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7403--7412, Singapore. Association for Computational Linguistics
-
[16]
Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, and 27 others. 2024. https://proceedings.mlr.press/v235/li2...
2024
-
[17]
Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, and 1 others. 2025 a . Rethinking machine unlearning for large language models. Nature Machine Intelligence, 7(2):181--194
2025
-
[18]
Yujian Liu, Yang Zhang, Tommi Jaakkola, and Shiyu Chang. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.495 Revisiting who ' s harry potter: Towards targeted unlearning from a causal intervention perspective . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8708--8731, Miami, Florida, USA. Association fo...
-
[19]
Zhenhua Liu, Tong Zhu, Chuanyuan Tan, and Wenliang Chen. 2025 b . https://aclanthology.org/2025.coling-main.114/ Learning to refuse: Towards mitigating privacy risks in LLM s . In Proceedings of the 31st International Conference on Computational Linguistics, pages 1683--1698, Abu Dhabi, UAE. Association for Computational Linguistics
2025
-
[20]
Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J Zico Kolter. 2024. https://openreview.net/forum?id=B41hNBoWLo TOFU : A task of fictitious unlearning for LLM s . In First Conference on Language Modeling
2024
-
[21]
Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/v1/2023.acl-long.546 When not to trust language models: Investigating effectiveness of parametric and non-parametric memories . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: L...
-
[22]
Anmol Mekala, Vineeth Dorna, Shreya Dubey, Abhishek Lalwani, David Koleczek, Mukund Rungta, Sadid Hasan, and Elita Lobo. 2025. https://aclanthology.org/2025.coling-main.252/ Alternate preference optimization for unlearning factual knowledge in large language models . In Proceedings of the 31st International Conference on Computational Linguistics, pages 3...
2025
-
[23]
Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. 2022. https://openreview.net/forum?id=-h6WAS6eE4 Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems
2022
-
[24]
Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. https://openreview.net/forum?id=3Tzcot1LKb Sim PO : Simple preference optimization with a reference-free reward . In The Thirty-eighth Annual Conference on Neural Information Processing Systems
2024
-
[25]
OpenAI. 2023. https://doi.org/10.48550/ARXIV.2303.08774 GPT-4 technical report . CoRR, abs/2303.08774
work page internal anchor Pith review Pith/arXiv arXiv doi:10.48550/arxiv.2303.08774 2023
-
[26]
OpenAI. 2024. https://arxiv.org/abs/2412.16720 Openai o1 system card . arXiv preprint arXiv:2412.16720
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[27]
Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason E Weston. 2024. https://openreview.net/forum?id=4XIKfvNYvx Iterative reasoning preference optimization . In The Thirty-eighth Annual Conference on Neural Information Processing Systems
2024
-
[28]
Vaidehi Patil, Peter Hase, and Mohit Bansal. 2024. https://openreview.net/forum?id=7erlRDoaV8 Can sensitive information be deleted from LLM s? objectives for defending against extraction attacks . In The Twelfth International Conference on Learning Representations
2024
-
[29]
Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. https://openreview.net/forum?id=HPuSIXJaa9 Direct preference optimization: Your language model is secretly a reward model . In Thirty-seventh Conference on Neural Information Processing Systems
2023
-
[30]
Smith, and Chiyuan Zhang
Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, and Chiyuan Zhang. 2025. https://openreview.net/forum?id=TArmA033BU MUSE : Machine unlearning six-way evaluation for language models . In The Thirteenth International Conference on Learning Representations
2025
-
[31]
Yash Sinha, Murari Mandal, and Mohan Kankanhalli. 2025. https://openreview.net/forum?id=mNXCViKZbI Un STAR : Unlearning with self-taught anti-sample reasoning for LLM s . Transactions on Machine Learning Research
2025
-
[32]
Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, and Dongbin Zhao. 2025. https://openreview.net/forum?id=OgWh4J7bkT Enhancing LLM reasoning with iterative DPO : A comprehensive empirical investigation . In Second Conference on Language Modeling
2025
-
[33]
Paul Voigt and Axel von dem Bussche. 2017. The EU General Data Protection Regulation (GDPR): A Practical Guide, 1st edition. Springer Publishing Company, Incorporated
2017
-
[34]
Changsheng Wang, Chongyu Fan, Yihua Zhang, Jinghan Jia, Dennis Wei, Parikshit Ram, Nathalie Baracaldo, and Sijia Liu. 2025 a . https://doi.org/10.18653/v1/2025.emnlp-main.220 Reasoning model unlearning: Forgetting traces, not just answers, while preserving reasoning skills . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Pr...
-
[35]
Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Shah, Yujia Bao, Yang Liu, and Wei Wei. 2025 b . https://openreview.net/forum?id=6ESRicalFE LLM unlearning via loss adjustment with only forget data . In The Thirteenth International Conference on Learning Representations
2025
-
[36]
Siyuan Wen, Meng Zhang, Yang Yang, and Ningning Ding. 2026. https://doi.org/10.1609/AAAI.V40I32.39895 Fedshard: Federated unlearning with efficiency fairness and performance fairness . In Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational A...
-
[37]
Heng Xu, Tianqing Zhu, Lefeng Zhang, Wanlei Zhou, and Philip S. Yu. 2023. https://doi.org/10.1145/3603620 Machine unlearning: A survey . ACM Comput. Surv., 56(1)
-
[38]
Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, and Dongsheng Li. 2025. Mitigating hallucinations in large vision-language models via dpo: On-policy data hold the key. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10610--10620
2025
-
[39]
Yuanshun Yao, Xiaojun Xu, and Yang Liu. 2024. https://openreview.net/forum?id=8Dy42ThoNe Large language model unlearning . In The Thirty-eighth Annual Conference on Neural Information Processing Systems
2024
-
[40]
Sangyeon Yoon, Wonje Jeung, and Albert No. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.265 R - TOFU : Unlearning in large reasoning models . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5239--5258, Suzhou, China. Association for Computational Linguistics
-
[41]
Xiaojian Yuan, Tianyu Pang, Chao Du, Kejiang Chen, Weiming Zhang, and Min Lin. 2025. https://openreview.net/forum?id=Q1MHvGmhyT A closer look at machine unlearning for large language models . In The Thirteenth International Conference on Learning Representations
2025
-
[42]
Dawen Zhang, Pamela Finckenberg-Broman, Thong Hoang, Shidong Pan, Zhenchang Xing, Mark Staples, and Xiwei Xu. 2025. Right to be forgotten in the era of large language models: Implications, challenges, and solutions. AI and Ethics, 5(3):2445--2454
2025
-
[43]
Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. 2024. https://openreview.net/forum?id=MXLBXjQkmb Negative preference optimization: From catastrophic collapse to effective unlearning . In First Conference on Language Modeling
2024
-
[44]
Yukai Zhou, Jian Lou, Zhijie Huang, Zhan Qin, Sibei Yang, and Wenjie Wang. 2025. https://doi.org/10.18653/v1/2025.findings-acl.1294 Don ' t say no: Jailbreaking LLM by suppressing refusal . In Findings of the Association for Computational Linguistics: ACL 2025, pages 25224--25249, Vienna, Austria. Association for Computational Linguistics
-
[45]
online" 'onlinestring :=
ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...
-
[46]
write newline
" write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.