pith. machine review for the scientific record. sign in

arxiv: 2604.15847 · v1 · submitted 2026-04-17 · 💻 cs.CL

CiPO: Counterfactual Unlearning for Large Reasoning Models through Iterative Preference Optimization

Pith reviewed 2026-05-10 08:46 UTC · model grok-4.3

classification 💻 cs.CL
keywords reasoningunlearningcipocounterfactuallrmspreferenceiterativelarge
0
0 comments X

The pith

CiPO removes undesired knowledge from both intermediate reasoning steps and final answers in large reasoning models by iteratively optimizing preferences toward valid counterfactual traces while keeping overall reasoning performance intact.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large reasoning models solve hard problems by producing long chains of thought before giving an answer. The problem is that these models sometimes remember private or copyrighted facts they should forget. Existing unlearning methods either leave traces of the unwanted knowledge inside the reasoning chain or damage the model's ability to think clearly. CiPO tries to fix this by first asking the model to create a different but logically sound reasoning path that leads to a different answer. It then uses preference optimization to make the model prefer this new path over the original one. The process repeats, each time updating the training examples so the gap between the original and new behavior grows. Experiments on standard benchmarks reportedly show that the unwanted facts disappear from both the middle steps and the final output, yet the model still solves other problems at the same level as before. The approach treats unlearning as a targeted change to the reasoning process rather than a blunt removal of facts.

Core claim

Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.

Load-bearing premise

That instructing the model to generate logically valid counterfactual reasoning traces and iteratively increasing their discrepancy from the original model will achieve complete unlearning of the target knowledge without side effects on unrelated reasoning tasks.

Figures

Figures reproduced from arXiv: 2604.15847 by Junyi Li, Ningning Ding, Yongqiang Chen.

Figure 1
Figure 1. Figure 1: Difference between LLMs and LRMs. scale of human data, however, the immense capac￾ity of LLMs also leads them to memorize and poten￾tially regenerate sensitive, private, or copyrighted information from the training data (Karamolegkou et al., 2023; Patil et al., 2024; Li et al., 2024). This raises significant privacy and ethical concerns, ne￾cessitating methods to control model knowledge post-training (Liu … view at source ↗
Figure 2
Figure 2. Figure 2: Comparison of outputs on the forget set from previous unlearning baselines (R [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Causal Graph for LRM Unlearning Causal view of LRM unlearning. Motivated by Liu et al. (2024) and Bao et al. (2025), we first build an explicit causal graph for the LRM unlearning problem, given in [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: CiPO framework with counterfactual generator (left) and iterative preference optimization (right). [PITH_FULL_IMAGE:figures/full_fig_p006_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Ablation Study on R-TOFU Forget01 cases. [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Prompts used as Counterfactual Answer Generator. [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: Prompts used as Counterfactual CoT Generator. [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Generated Example of counterfactual answer. [PITH_FULL_IMAGE:figures/full_fig_p016_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Examples of R-TOFU forget set. Counterfactual Set Example Question: What is the full name of the male author born in Taipei, Taiwan on the 15th of April, 1992? CoT: Alright, let\’s figure out the full name of the male author born in Taipei, Taiwan on April 15th, 1992. I need to use my domain knowledge to reason this out. First, I know that Taipei is the capital city of Taiwan, so the person in question is … view at source ↗
Figure 10
Figure 10. Figure 10: Examples of counterfactual set generated based on R-TOFU forget set. [PITH_FULL_IMAGE:figures/full_fig_p017_10.png] view at source ↗
Figure 11
Figure 11. Figure 11: Examples of real-world cases. 18 [PITH_FULL_IMAGE:figures/full_fig_p018_11.png] view at source ↗
Figure 12
Figure 12. Figure 12: Prompts used to evaluate answer correctness. [PITH_FULL_IMAGE:figures/full_fig_p019_12.png] view at source ↗
Figure 13
Figure 13. Figure 13: Prompts used to get the CoT Leakage Score. [PITH_FULL_IMAGE:figures/full_fig_p019_13.png] view at source ↗
read the original abstract

Machine unlearning has gained increasing attention in recent years, as a promising technique to selectively remove unwanted privacy or copyrighted information from Large Language Models that are trained on a massive scale of human data. However, the emergence of Large Reasoning Models (LRMs), which emphasize long chain-of-thought (CoT) reasoning to address complex questions, presents a dilemma to unlearning: existing methods either struggle to completely eliminate undesired knowledge from the CoT traces or degrade the reasoning performances due to the interference with the reasoning process. To this end, we introduce Counterfactual Unlearning through iterative Preference Optimization (CiPO), a novel framework that redefines unlearning as the targeted intervention of the CoT reasoning in LRMs. More specifically, given a desired unlearning target answer, CiPO instructs LRMs to generate a logically valid counterfactual reasoning trace for preference tuning. As the LRM adjusts to the counterfactual trace, CiPO iteratively updates the preference learning data to increase the discrepancy from the original model. This iterative loop ensures both desirable unlearning and smooth optimization, effectively mitigating the dilemma. Experiments on challenging benchmarks demonstrate that CiPO excels at unlearning, completely removing knowledge from both the intermediate CoT steps and the final answer, while preserving the reasoning abilities of LRMs.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Axiom & Free-Parameter Ledger

0 free parameters · 2 axioms · 0 invented entities

Abstract-only review provides no explicit free parameters, axioms, or invented entities; the method implicitly assumes that counterfactual traces can be generated on demand and that preference optimization can be iterated without destabilizing the model.

axioms (2)
  • domain assumption LRMs can be instructed to produce logically valid counterfactual reasoning traces for any given unlearning target.
    Stated in the method description as the basis for preference tuning data.
  • ad hoc to paper Iterative updates to preference data will monotonically increase discrepancy from the original model without introducing new unwanted behaviors.
    Central to the iterative loop claimed to ensure both unlearning and smooth optimization.

pith-pipeline@v0.9.0 · 5524 in / 1495 out tokens · 28600 ms · 2026-05-10T08:46:20.471958+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

46 extracted references · 20 canonical work pages · 5 internal anchors

  1. [1]

    Guangsheng Bao, Hongbo Zhang, Cunxiang Wang, Linyi Yang, and Yue Zhang. 2025. https://aclanthology.org/2025.coling-main.524/ How likely do LLM s with C o T mimic human reasoning? In Proceedings of the 31st International Conference on Computational Linguistics, pages 7831--7850, Abu Dhabi, UAE. Association for Computational Linguistics

  2. [2]

    Yinzhi Cao and Junfeng Yang. 2015. Towards making systems forget with machine unlearning. In 2015 IEEE symposium on security and privacy, pages 463--480. IEEE

  3. [3]

    Kongyang Chen, Zixin Wang, Bing Mi, Waixi Liu, Shaowei Wang, Xiaojun Ren, and Jiaxing Shen. 2024. Machine unlearning in large language models. arXiv preprint arXiv:2404.16841

  4. [4]

    DeepSeek-AI. 2024. https://arxiv.org/abs/2412.19437 Deepseek-v3 technical report . Preprint, arXiv:2412.19437

  5. [5]

    DeepSeek-AI. 2025. https://arxiv.org/abs/2501.12948 Deepseek-r1: Incentivizing reasoning capability in llms via reinforcement learning . Preprint, arXiv:2501.12948

  6. [6]

    Vineeth Dorna, Anmol Reddy Mekala, Wenlong Zhao, Andrew McCallum, J Zico Kolter, Zachary Chase Lipton, and Pratyush Maini. 2025. https://openreview.net/forum?id=Gy67Zh5X1i Openunlearning: Accelerating LLM unlearning via unified benchmarking of methods and metrics . In The Thirty-ninth Annual Conference on Neural Information Processing Systems Datasets and...

  7. [7]

    Ronen Eldan and Mark Russinovich. 2023. https://arxiv.org/abs/2310.02238 Who's harry potter? approximate unlearning in llms . Preprint, arXiv:2310.02238

  8. [8]

    Chongyu Fan, Jiancheng Liu, Licong Lin, Jinghan Jia, Ruiqi Zhang, Song Mei, and Sijia Liu. 2025. https://openreview.net/forum?id=JbvSQm5h1l Simplicity prevails: Rethinking negative preference optimization for LLM unlearning . In The Thirty-ninth Annual Conference on Neural Information Processing Systems

  9. [9]

    Leo Gao, Jonathan Tow, Baber Abbasi, Stella Biderman, Sid Black, Anthony DiPofi, Charles Foster, Laurence Golding, Jeffrey Hsu, Alain Le Noac'h, Haonan Li, Kyle McDonell, Niklas Muennighoff, Chris Ociepa, Jason Phang, Laria Reynolds, Hailey Schoelkopf, Aviya Skowron, Lintang Sutawika, and 5 others. 2024. https://doi.org/10.5281/zenodo.12608602 The languag...

  10. [10]

    Aaron Grattafiori, Abhimanyu Dubey, Abhinav Jauhri, Abhinav Pandey, Abhishek Kadian, Ahmad Al-Dahle, Aiesha Letman, Akhil Mathur, Alan Schelten, Alex Vaughan, and 1 others. 2024. The llama 3 herd of models. arXiv preprint arXiv:2407.21783

  11. [11]

    Tommaso Green, Martin Gubri, Haritz Puerto, Sangdoo Yun, and Seong Joon Oh. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.1347 Leaky thoughts: Large reasoning models are not private thinkers . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 26507--26529, Suzhou, China. Association for Computational Linguistics

  12. [12]

    Shangmin Guo, Biao Zhang, Tianlin Liu, Tianqi Liu, Misha Khalman, Felipe Llinares, Alexandre Rame, Thomas Mesnard, Yao Zhao, Bilal Piot, and 1 others. 2024. Direct language model alignment from online ai feedback. arXiv preprint arXiv:2402.04792

  13. [13]

    Dang Huu-Tien, Tin Pham, Hoang Thanh-Tung, and Naoya Inoue. 2025. https://doi.org/10.1609/aaai.v39i22.34544 On effects of steering latent representation for large language model unlearning . In Proceedings of the Thirty-Ninth AAAI Conference on Artificial Intelligence and Thirty-Seventh Conference on Innovative Applications of Artificial Intelligence and ...

  14. [14]

    Joel Jang, Dongkeun Yoon, Sohee Yang, Sungmin Cha, Moontae Lee, Lajanugen Logeswaran, and Minjoon Seo. 2023. https://doi.org/10.18653/v1/2023.acl-long.805 Knowledge unlearning for mitigating privacy risks in language models . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 14389--14...

  15. [15]

    Antonia Karamolegkou, Jiaang Li, Li Zhou, and Anders S gaard. 2023. https://doi.org/10.18653/v1/2023.emnlp-main.458 Copyright violations and large language models . In Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing, pages 7403--7412, Singapore. Association for Computational Linguistics

  16. [16]

    Nathaniel Li, Alexander Pan, Anjali Gopal, Summer Yue, Daniel Berrios, Alice Gatti, Justin D. Li, Ann-Kathrin Dombrowski, Shashwat Goel, Gabriel Mukobi, Nathan Helm-Burger, Rassin Lababidi, Lennart Justen, Andrew Bo Liu, Michael Chen, Isabelle Barrass, Oliver Zhang, Xiaoyuan Zhu, Rishub Tamirisa, and 27 others. 2024. https://proceedings.mlr.press/v235/li2...

  17. [17]

    Sijia Liu, Yuanshun Yao, Jinghan Jia, Stephen Casper, Nathalie Baracaldo, Peter Hase, Yuguang Yao, Chris Yuhao Liu, Xiaojun Xu, Hang Li, and 1 others. 2025 a . Rethinking machine unlearning for large language models. Nature Machine Intelligence, 7(2):181--194

  18. [18]

    Yujian Liu, Yang Zhang, Tommi Jaakkola, and Shiyu Chang. 2024. https://doi.org/10.18653/v1/2024.emnlp-main.495 Revisiting who ' s harry potter: Towards targeted unlearning from a causal intervention perspective . In Proceedings of the 2024 Conference on Empirical Methods in Natural Language Processing, pages 8708--8731, Miami, Florida, USA. Association fo...

  19. [19]

    Zhenhua Liu, Tong Zhu, Chuanyuan Tan, and Wenliang Chen. 2025 b . https://aclanthology.org/2025.coling-main.114/ Learning to refuse: Towards mitigating privacy risks in LLM s . In Proceedings of the 31st International Conference on Computational Linguistics, pages 1683--1698, Abu Dhabi, UAE. Association for Computational Linguistics

  20. [20]

    Pratyush Maini, Zhili Feng, Avi Schwarzschild, Zachary Chase Lipton, and J Zico Kolter. 2024. https://openreview.net/forum?id=B41hNBoWLo TOFU : A task of fictitious unlearning for LLM s . In First Conference on Language Modeling

  21. [21]

    Alex Mallen, Akari Asai, Victor Zhong, Rajarshi Das, Daniel Khashabi, and Hannaneh Hajishirzi. 2023. https://doi.org/10.18653/v1/2023.acl-long.546 When not to trust language models: Investigating effectiveness of parametric and non-parametric memories . In Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: L...

  22. [22]

    Anmol Mekala, Vineeth Dorna, Shreya Dubey, Abhishek Lalwani, David Koleczek, Mukund Rungta, Sadid Hasan, and Elita Lobo. 2025. https://aclanthology.org/2025.coling-main.252/ Alternate preference optimization for unlearning factual knowledge in large language models . In Proceedings of the 31st International Conference on Computational Linguistics, pages 3...

  23. [23]

    Kevin Meng, David Bau, Alex J Andonian, and Yonatan Belinkov. 2022. https://openreview.net/forum?id=-h6WAS6eE4 Locating and editing factual associations in GPT . In Advances in Neural Information Processing Systems

  24. [24]

    Yu Meng, Mengzhou Xia, and Danqi Chen. 2024. https://openreview.net/forum?id=3Tzcot1LKb Sim PO : Simple preference optimization with a reference-free reward . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  25. [25]

    OpenAI. 2023. https://doi.org/10.48550/ARXIV.2303.08774 GPT-4 technical report . CoRR, abs/2303.08774

  26. [26]

    OpenAI. 2024. https://arxiv.org/abs/2412.16720 Openai o1 system card . arXiv preprint arXiv:2412.16720

  27. [27]

    Richard Yuanzhe Pang, Weizhe Yuan, He He, Kyunghyun Cho, Sainbayar Sukhbaatar, and Jason E Weston. 2024. https://openreview.net/forum?id=4XIKfvNYvx Iterative reasoning preference optimization . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  28. [28]

    Vaidehi Patil, Peter Hase, and Mohit Bansal. 2024. https://openreview.net/forum?id=7erlRDoaV8 Can sensitive information be deleted from LLM s? objectives for defending against extraction attacks . In The Twelfth International Conference on Learning Representations

  29. [29]

    Rafael Rafailov, Archit Sharma, Eric Mitchell, Christopher D Manning, Stefano Ermon, and Chelsea Finn. 2023. https://openreview.net/forum?id=HPuSIXJaa9 Direct preference optimization: Your language model is secretly a reward model . In Thirty-seventh Conference on Neural Information Processing Systems

  30. [30]

    Smith, and Chiyuan Zhang

    Weijia Shi, Jaechan Lee, Yangsibo Huang, Sadhika Malladi, Jieyu Zhao, Ari Holtzman, Daogao Liu, Luke Zettlemoyer, Noah A. Smith, and Chiyuan Zhang. 2025. https://openreview.net/forum?id=TArmA033BU MUSE : Machine unlearning six-way evaluation for language models . In The Thirteenth International Conference on Learning Representations

  31. [31]

    Yash Sinha, Murari Mandal, and Mohan Kankanhalli. 2025. https://openreview.net/forum?id=mNXCViKZbI Un STAR : Unlearning with self-taught anti-sample reasoning for LLM s . Transactions on Machine Learning Research

  32. [32]

    Songjun Tu, Jiahao Lin, Xiangyu Tian, Qichao Zhang, Linjing Li, Yuqian Fu, Nan Xu, Wei He, Xiangyuan Lan, Dongmei Jiang, and Dongbin Zhao. 2025. https://openreview.net/forum?id=OgWh4J7bkT Enhancing LLM reasoning with iterative DPO : A comprehensive empirical investigation . In Second Conference on Language Modeling

  33. [33]

    Paul Voigt and Axel von dem Bussche. 2017. The EU General Data Protection Regulation (GDPR): A Practical Guide, 1st edition. Springer Publishing Company, Incorporated

  34. [34]

    Changsheng Wang, Chongyu Fan, Yihua Zhang, Jinghan Jia, Dennis Wei, Parikshit Ram, Nathalie Baracaldo, and Sijia Liu. 2025 a . https://doi.org/10.18653/v1/2025.emnlp-main.220 Reasoning model unlearning: Forgetting traces, not just answers, while preserving reasoning skills . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Pr...

  35. [35]

    Yaxuan Wang, Jiaheng Wei, Chris Yuhao Liu, Jinlong Pang, Quan Liu, Ankit Shah, Yujia Bao, Yang Liu, and Wei Wei. 2025 b . https://openreview.net/forum?id=6ESRicalFE LLM unlearning via loss adjustment with only forget data . In The Thirteenth International Conference on Learning Representations

  36. [36]

    Siyuan Wen, Meng Zhang, Yang Yang, and Ningning Ding. 2026. https://doi.org/10.1609/AAAI.V40I32.39895 Fedshard: Federated unlearning with efficiency fairness and performance fairness . In Fortieth AAAI Conference on Artificial Intelligence, Thirty-Eighth Conference on Innovative Applications of Artificial Intelligence, Sixteenth Symposium on Educational A...

  37. [37]

    Heng Xu, Tianqing Zhu, Lefeng Zhang, Wanlei Zhou, and Philip S. Yu. 2023. https://doi.org/10.1145/3603620 Machine unlearning: A survey . ACM Comput. Surv., 56(1)

  38. [38]

    Zhihe Yang, Xufang Luo, Dongqi Han, Yunjian Xu, and Dongsheng Li. 2025. Mitigating hallucinations in large vision-language models via dpo: On-policy data hold the key. In Proceedings of the Computer Vision and Pattern Recognition Conference, pages 10610--10620

  39. [39]

    Yuanshun Yao, Xiaojun Xu, and Yang Liu. 2024. https://openreview.net/forum?id=8Dy42ThoNe Large language model unlearning . In The Thirty-eighth Annual Conference on Neural Information Processing Systems

  40. [40]

    Sangyeon Yoon, Wonje Jeung, and Albert No. 2025. https://doi.org/10.18653/v1/2025.emnlp-main.265 R - TOFU : Unlearning in large reasoning models . In Proceedings of the 2025 Conference on Empirical Methods in Natural Language Processing, pages 5239--5258, Suzhou, China. Association for Computational Linguistics

  41. [41]

    Xiaojian Yuan, Tianyu Pang, Chao Du, Kejiang Chen, Weiming Zhang, and Min Lin. 2025. https://openreview.net/forum?id=Q1MHvGmhyT A closer look at machine unlearning for large language models . In The Thirteenth International Conference on Learning Representations

  42. [42]

    Dawen Zhang, Pamela Finckenberg-Broman, Thong Hoang, Shidong Pan, Zhenchang Xing, Mark Staples, and Xiwei Xu. 2025. Right to be forgotten in the era of large language models: Implications, challenges, and solutions. AI and Ethics, 5(3):2445--2454

  43. [43]

    Ruiqi Zhang, Licong Lin, Yu Bai, and Song Mei. 2024. https://openreview.net/forum?id=MXLBXjQkmb Negative preference optimization: From catastrophic collapse to effective unlearning . In First Conference on Language Modeling

  44. [44]

    Yukai Zhou, Jian Lou, Zhijie Huang, Zhan Qin, Sibei Yang, and Wenjie Wang. 2025. https://doi.org/10.18653/v1/2025.findings-acl.1294 Don ' t say no: Jailbreaking LLM by suppressing refusal . In Findings of the Association for Computational Linguistics: ACL 2025, pages 25224--25249, Vienna, Austria. Association for Computational Linguistics

  45. [45]

    online" 'onlinestring :=

    ENTRY address archivePrefix author booktitle chapter edition editor eid eprint eprinttype howpublished institution journal key month note number organization pages publisher school series title type volume year doi pubmed url lastchecked label extra.label sort.label short.list INTEGERS output.state before.all mid.sentence after.sentence after.block STRING...

  46. [46]

    write newline

    " write newline "" before.all 'output.state := FUNCTION n.dashify 't := "" t empty not t #1 #1 substring "-" = t #1 #2 substring "--" = not "--" * t #2 global.max substring 't := t #1 #1 substring "-" = "-" * t #2 global.max substring 't := while if t #1 #1 substring * t #2 global.max substring 't := if while FUNCTION word.in bbl.in capitalize " " * FUNCT...