Recognition: unknown
Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure
Pith reviewed 2026-05-10 15:58 UTC · model grok-4.3
The pith
Geometric Unlearning lets LLMs forget specific private facts using only a handful of synthetic prompts while retaining general performance.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
The paper establishes that Geometric Unlearning operates directly on prompt-time planning states by first distilling a low-rank geometry of desired safe behavior from a small set of safe reference prompts, then applying projection-based alignment of hidden representations using synthetic in-context anchors, together with a teacher-distillation regularizer on non-target anchors, to suppress target information without access to the original training corpus.
What carries the argument
Geometric Unlearning (GU): extraction of a compact low-rank safe-behavior geometry from reference prompts followed by projection alignment of hidden planning states via synthetic anchors.
If this is right
- Strong suppression of target entities is achieved on ToFU and UnlearnPII benchmarks without original training data.
- Non-target performance remains largely intact when alignment uses only minimal synthetic prompts.
- Localized projection on hidden states avoids the broad gradient updates common in prior methods.
- A teacher-distillation regularizer on synthetic non-target anchors limits collateral drift during unlearning.
Where Pith is reading between the lines
- The same low-rank alignment idea could be tested on unlearning tasks in non-language models such as vision or multimodal systems.
- Organizations handling regulated data might adopt this approach to meet deletion requests without maintaining full training archives.
- If the safe geometry remains stable across model scales, the method could support repeated unlearning cycles on the same base model.
Load-bearing premise
A low-rank geometry distilled from a few safe prompts can be projected onto hidden states to suppress chosen target information without broad utility loss or access to the original data.
What would settle it
Run the method on a model in which target facts are deliberately entangled across many dimensions in the hidden states; if target suppression fails or non-target accuracy drops sharply, the geometric alignment approach does not hold.
Figures
read the original abstract
As large language models (LLMs) are increasingly deployed in real-world systems, they must support post-hoc removal of specific content to meet privacy and governance requirements. This motivates selective unlearning, which suppresses information about a particular entity or topic while preserving the LLM's general utility. However, most existing LLM unlearning methods require access to the original training corpus and rely on output-level refusal tuning or broad gradient updates, creating a tension among unlearning strength, non-target preservation, and data availability. We propose Geometric Unlearning (GU), an approach that operates directly on the model's prompt-time planning states without access to the original training corpus. GU distills a compact, low-rank geometry of desired safe behavior from a small set of safe reference prompts, and uses lightweight anchor-in-context synthetic prompts to trigger localized, projection-based alignment of hidden planning representations to this safe geometry. A teacher-distillation regularizer on synthetic non-target anchors further reduces collateral drift. Across privacy-oriented unlearning benchmarks (ToFU and UnlearnPII), GU achieves strong target suppression with minimal impact on non-target performance, demonstrating that effective unlearning can be achieved with minimal synthetic data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The manuscript proposes Geometric Unlearning (GU) for selective unlearning in LLMs. GU distills a compact low-rank geometry of safe behavior from a small set of safe reference prompts and performs projection-based alignment of hidden planning states using lightweight anchor-in-context synthetic prompts, plus a teacher-distillation regularizer on non-target anchors. The method requires no access to the original training corpus. On the ToFU and UnlearnPII privacy benchmarks, GU is reported to achieve strong target suppression while preserving non-target performance, using only minimal synthetic data.
Significance. If the central geometric alignment mechanism is shown to reliably remove target encodings from hidden states, the work would be significant for practical LLM governance. It reduces reliance on original data and broad updates, offering a data-efficient alternative to existing unlearning techniques. The emphasis on low-rank distillation and synthetic anchors could influence future privacy-preserving methods, provided the approach generalizes beyond output-level metrics.
major comments (2)
- [§3] §3 (Geometric Unlearning procedure): The core claim that projection onto the distilled low-rank safe geometry suppresses target information encoded during original training is load-bearing, yet the manuscript provides no hidden-state probing, membership-inference, or subspace analysis to confirm that target signals are removed rather than merely masked at the output level. Without such verification, residual encodings in orthogonal subspaces cannot be ruled out.
- [§4] §4 (Benchmark evaluation): Results on ToFU and UnlearnPII report strong target suppression with minimal non-target degradation, but the evaluation relies on output accuracy and refusal metrics. No ablation isolating the contribution of the low-rank projection versus the synthetic anchors or regularizer is presented, making it difficult to attribute success specifically to the geometric component.
minor comments (2)
- [§3.1] The notation for the projection operator and the rank parameter in the low-rank geometry distillation should be defined more explicitly, ideally with a small illustrative equation.
- [Figure 1] Figure captions for the method overview diagram could more clearly label the flow from safe prompts to anchor alignment and the role of the teacher regularizer.
Simulated Author's Rebuttal
We thank the referee for the thoughtful and constructive comments. We address each major point below and will incorporate revisions to strengthen the empirical support for the geometric mechanism.
read point-by-point responses
-
Referee: [§3] §3 (Geometric Unlearning procedure): The core claim that projection onto the distilled low-rank safe geometry suppresses target information encoded during original training is load-bearing, yet the manuscript provides no hidden-state probing, membership-inference, or subspace analysis to confirm that target signals are removed rather than merely masked at the output level. Without such verification, residual encodings in orthogonal subspaces cannot be ruled out.
Authors: We agree that direct verification of hidden-state suppression is important for substantiating the central mechanism. While the method explicitly aligns planning states via projection and the output-level results on ToFU and UnlearnPII demonstrate effective target suppression with preserved utility, we acknowledge the absence of internal analysis. In the revision we will add hidden-state probing, membership-inference attacks on the target subspace, and before/after subspace overlap metrics to show that target encodings are reduced rather than merely masked at the output. revision: yes
-
Referee: [§4] §4 (Benchmark evaluation): Results on ToFU and UnlearnPII report strong target suppression with minimal non-target degradation, but the evaluation relies on output accuracy and refusal metrics. No ablation isolating the contribution of the low-rank projection versus the synthetic anchors or regularizer is presented, making it difficult to attribute success specifically to the geometric component.
Authors: We thank the referee for highlighting the need for component-wise attribution. The current results show the full pipeline works with minimal data, but we agree that isolating the low-rank projection is necessary. In the revised manuscript we will include ablations that remove or replace the projection step (while retaining anchors and regularizer) and report the resulting changes in target suppression and non-target performance on both benchmarks. revision: yes
Circularity Check
No significant circularity detected in derivation chain
full rationale
The paper proposes Geometric Unlearning as a new method that distills a low-rank safe geometry from a small set of reference prompts and performs projection-based alignment on hidden states using synthetic anchors, with a teacher-distillation regularizer. All load-bearing steps (geometry distillation, projection alignment, and regularizer) are defined from first principles and external synthetic data rather than fitted to target outcomes or reduced to self-citations. Empirical results on ToFU and UnlearnPII are independent external benchmarks, not constructed by definition from the method inputs. No self-definitional, fitted-prediction, or uniqueness-imported circularity is present in the described chain.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The internal hidden states of LLMs during prompt processing contain planning representations that can be aligned geometrically to achieve unlearning.
invented entities (2)
-
low-rank geometry of desired safe behavior
no independent evidence
-
anchor-in-context synthetic prompts
no independent evidence
Reference graph
Works this paper leans on
-
[1]
Nature Machine Intelligence , pages=
Rethinking machine unlearning for large language models , author=. Nature Machine Intelligence , pages=. 2025 , publisher=
2025
-
[2]
Chongyu Fan and Jinghan Jia and Yihua Zhang and Anil Ramakrishna and Mingyi Hong and Sijia Liu , booktitle=. Towards
-
[3]
Forty-second International Conference on Machine Learning , year=
A Certified Unlearning Approach without Access to Source Data , author=. Forty-second International Conference on Machine Learning , year=
-
[4]
Proceedings of the 31st International Conference on Computational Linguistics , pages=
Alternate preference optimization for unlearning factual knowledge in large language models , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=
-
[5]
The Twelfth International Conference on Learning Representations , year=
Detecting Pretraining Data from Large Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[6]
30th USENIX security symposium (USENIX Security 21) , pages=
Extracting training data from large language models , author=. 30th USENIX security symposium (USENIX Security 21) , pages=
-
[7]
arXiv preprint arXiv:2407.06460 , year=
Muse: Machine unlearning six-way evaluation for language models , author=. arXiv preprint arXiv:2407.06460 , year=
-
[8]
Advances in Neural Information Processing Systems , volume=
Memorization without overfitting: Analyzing the training dynamics of large language models , author=. Advances in Neural Information Processing Systems , volume=
-
[9]
Proceedings of the Natural Legal Language Processing Workshop 2025 , pages=
Machine Unlearning of Personally Identifiable Information in Large Language Models , author=. Proceedings of the Natural Legal Language Processing Workshop 2025 , pages=
2025
-
[10]
A practical guide, 1st ed., Cham: Springer International Publishing , volume=
The eu general data protection regulation (gdpr) , author=. A practical guide, 1st ed., Cham: Springer International Publishing , volume=. 2017 , publisher=
2017
-
[11]
Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
Towards Source-Free Machine Unlearning , author=. Proceedings of the Computer Vision and Pattern Recognition Conference , pages=
-
[12]
Zhang and Nicolas Papernot , booktitle=
Andrei Ioan Muresanu and Anvith Thudi and Michael R. Zhang and Nicolas Papernot , booktitle=. Fast Exact Unlearning for In-Context Learning Data for
-
[13]
Rethinking
Qizhou Wang and Jin Peng Zhou and Zhanke Zhou and Saebyeol Shin and Bo Han and Kilian Q Weinberger , booktitle=. Rethinking
-
[14]
Proceedings of the 31st International Conference on Computational Linguistics , pages=
Unveiling entity-level unlearning for large language models: A comprehensive analysis , author=. Proceedings of the 31st International Conference on Computational Linguistics , pages=
-
[15]
Proceedings of the 41st International Conference on Machine Learning , pages=
The WMDP benchmark: measuring and reducing malicious use with unlearning , author=. Proceedings of the 41st International Conference on Machine Learning , pages=
-
[16]
Not Every Token Needs Forgetting: Selective Unlearning to Limit Change in Utility in Large Language Model Unlearning , author=. arXiv preprint arXiv:2506.00876 , year=
-
[17]
Towards Safer Large Language Models through Machine Unlearning
Liu, Zheyuan and Dou, Guangyao and Tan, Zhaoxuan and Tian, Yijun and Jiang, Meng. Towards Safer Large Language Models through Machine Unlearning. Findings of the Association for Computational Linguistics: ACL 2024. 2024
2024
-
[18]
Advances in Neural Information Processing Systems , volume=
Large language model unlearning via embedding-corrupted prompts , author=. Advances in Neural Information Processing Systems , volume=
-
[19]
arXiv preprint arXiv:2506.13181 , year=
Align-then-Unlearn: Embedding Alignment for LLM Unlearning , author=. arXiv preprint arXiv:2506.13181 , year=
-
[20]
Towards Robust and Parameter-Efficient Knowledge Unlearning for
Sungmin Cha and Sungjun Cho and Dasol Hwang and Moontae Lee , booktitle=. Towards Robust and Parameter-Efficient Knowledge Unlearning for
-
[21]
Advances in Neural Information Processing Systems , volume=
Large language model unlearning , author=. Advances in Neural Information Processing Systems , volume=
-
[22]
First Conference on Language Modeling , year=
Negative Preference Optimization: From Catastrophic Collapse to Effective Unlearning , author=. First Conference on Language Modeling , year=
-
[23]
Knowledge Unlearning for Mitigating Privacy Risks in Language Models
Jang, Joel and Yoon, Dongkeun and Yang, Sohee and Cha, Sungmin and Lee, Moontae and Logeswaran, Lajanugen and Seo, Minjoon. Knowledge Unlearning for Mitigating Privacy Risks in Language Models. Proceedings of the 61st Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers). 2023
2023
-
[24]
Shen and Xinchi Qiu and Meghdad Kurmanji and Alex Iacob and Lorenzo Sani and Yihong Chen and Nicola Cancedda and Nicholas D
William F. Shen and Xinchi Qiu and Meghdad Kurmanji and Alex Iacob and Lorenzo Sani and Yihong Chen and Nicola Cancedda and Nicholas D. Lane , booktitle=
-
[25]
Reliable Unlearning Harmful Information in LLMs with Metamorphosis Representation Projection , author=
-
[26]
Proceedings of the AAAI Conference on Artificial Intelligence , volume=
On Effects of Steering Latent Representation for Large Language Model Unlearning , author=. Proceedings of the AAAI Conference on Artificial Intelligence , volume=
-
[27]
Llm unlearning via loss adjustment with only forget data,
Llm unlearning via loss adjustment with only forget data , author=. arXiv preprint arXiv:2410.11143 , year=
-
[28]
Emergent Response Planning in
Zhichen Dong and Zhanhui Zhou and Zhixuan Liu and Chao Yang and Chaochao Lu , booktitle=. Emergent Response Planning in
-
[29]
Future Lens: Anticipating Subsequent Tokens from a Single Hidden State
Pal, Koyena and Sun, Jiuding and Yuan, Andrew and Wallace, Byron and Bau, David. Future Lens: Anticipating Subsequent Tokens from a Single Hidden State. Proceedings of the 27th Conference on Computational Natural Language Learning (CoNLL). 2023
2023
-
[30]
Advances in Neural Information Processing Systems , volume=
Refusal in language models is mediated by a single direction , author=. Advances in Neural Information Processing Systems , volume=
-
[31]
Juyeon Heo and Christina Heinze-Deml and Oussama Elachqar and Kwan Ho Ryan Chan and Shirley You Ren and Andrew Miller and Udhyakumar Nallasamy and Jaya Narain , booktitle=. Do
-
[32]
The Twelfth International Conference on Learning Representations , year=
Linearity of Relation Decoding in Transformer Language Models , author=. The Twelfth International Conference on Learning Representations , year=
-
[33]
Pratyush Maini and Zhili Feng and Avi Schwarzschild and Zachary Chase Lipton and J Zico Kolter , booktitle=
-
[34]
Nathaniel Li and Alexander Pan and Anjali Gopal and Summer Yue and Daniel Berrios and Alice Gatti and Justin D. Li and Ann-Kathrin Dombrowski and Shashwat Goel and Gabriel Mukobi and Nathan Helm-Burger and Rassin Lababidi and Lennart Justen and Andrew Bo Liu and Michael Chen and Isabelle Barrass and Oliver Zhang and Xiaoyuan Zhu and Rishub Tamirisa and Bh...
-
[35]
UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models
Dong, Yijiang River and Lin, Hongzhou and Belkin, Mikhail and Huerta, Ramon and Vuli \'c , Ivan. UNDIAL : Self-Distillation with Adjusted Logits for Robust Unlearning in Large Language Models. Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: L...
2025
-
[36]
OpenUnlearning: Accelerating
Vineeth Dorna and Anmol Reddy Mekala and Wenlong Zhao and Andrew McCallum and J Zico Kolter and Zachary Chase Lipton and Pratyush Maini , booktitle=. OpenUnlearning: Accelerating
-
[37]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Language Models Can Predict Their Own Behavior , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[38]
Steering Language Models With Activation Engineering
Steering language models with activation engineering , author=. arXiv preprint arXiv:2308.10248 , year=
work page internal anchor Pith review arXiv
-
[39]
Activation addition: Steering language models without optimization , author=
-
[40]
Refusal in llms is an affine function
Refusal in llms is an affine function , author=. arXiv preprint arXiv:2411.09003 , year=
-
[41]
The Thirteenth International Conference on Learning Representations , year=
Programming Refusal with Conditional Activation Steering , author=. The Thirteenth International Conference on Learning Representations , year=
-
[42]
arXiv preprint arXiv:2507.09709 , year=
Large language models encode semantics in low-dimensional linear subspaces , author=. arXiv preprint arXiv:2507.09709 , year=
-
[43]
International Conference on Machine Learning , pages=
The Linear Representation Hypothesis and the Geometry of Large Language Models , author=. International Conference on Machine Learning , pages=. 2024 , organization=
2024
-
[44]
The Low-Dimensional Linear Geometry of Contextualized Word Representations
Hernandez, Evan and Andreas, Jacob. The Low-Dimensional Linear Geometry of Contextualized Word Representations. Proceedings of the 25th Conference on Computational Natural Language Learning. 2021
2021
-
[45]
Beyond Single Concept Vector: Modeling Concept Subspace in
Haiyan Zhao and Heng Zhao and Bo Shen and Ali Payani and Fan Yang and Mengnan Du , booktitle=. Beyond Single Concept Vector: Modeling Concept Subspace in
-
[46]
Simplicity Prevails: Rethinking Negative Preference Optimization for
Chongyu Fan and Jiancheng Liu and Licong Lin and Jinghan Jia and Ruiqi Zhang and Song Mei and Sijia Liu , booktitle=. Simplicity Prevails: Rethinking Negative Preference Optimization for
-
[47]
Advances in Neural Information Processing Systems , volume=
Reversing the forget-retain objectives: An efficient llm unlearning framework from logit difference , author=. Advances in Neural Information Processing Systems , volume=
-
[48]
Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
Unlearn What You Want to Forget: Efficient Unlearning for LLMs , author=. Proceedings of the 2023 Conference on Empirical Methods in Natural Language Processing , pages=
2023
-
[49]
International Conference on Machine Learning , pages=
In-Context Unlearning: Language Models as Few-Shot Unlearners , author=. International Conference on Machine Learning , pages=. 2024 , organization=
2024
-
[50]
The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
Wisdom is Knowing What not to Say: Hallucination-Free LLMs Unlearning via Attention Shifting , author=. The Thirty-ninth Annual Conference on Neural Information Processing Systems , year=
-
[51]
The Thirteenth International Conference on Learning Representations , year=
On Large Language Model Continual Unlearning , author=. The Thirteenth International Conference on Learning Representations , year=
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.