Less is More: Geometric Unlearning for LLMs with Minimal Data Disclosure
Pith reviewed 2026-07-01 00:27 UTC · model grok-4.3
The pith
Geometric Unlearning suppresses specific LLM knowledge by projecting hidden states onto a low-rank safe subspace distilled from minimal safe prompts.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Geometric Unlearning distills a compact low-rank safe-behavior subspace from a small set of safe reference prompts and performs localized projection-based alignment of prompt-conditioned hidden states onto this subspace using lightweight synthetic anchors, with a teacher-distillation regularizer on non-target anchors to limit collateral drift, achieving target suppression without access to the original training corpus.
What carries the argument
The low-rank safe-behavior subspace distilled from safe reference prompts, which carries the argument by serving as the target for localized projection alignment of hidden representations.
If this is right
- Target suppression remains strong on the ToFU and UnlearnPII benchmarks while non-target performance stays close to the original model.
- Unlearning succeeds using only synthetic data and no access to the original training corpus.
- The method reduces collateral drift through the added regularizer on synthetic non-target anchors.
- Localized projection on hidden states replaces the need for broad gradient updates or refusal tuning.
Where Pith is reading between the lines
- The same projection approach might extend to forgetting specific capabilities rather than just factual entities if suitable reference subspaces can be identified.
- Deployment pipelines could integrate this form of unlearning as a lightweight post-training step triggered by new privacy requests.
- The reliance on synthetic anchors suggests that generating high-quality non-target examples becomes a key practical variable for scaling the technique.
Load-bearing premise
That alignment to a subspace derived from a small number of safe prompts will selectively remove target information without causing broader unintended changes in the model's behavior.
What would settle it
An experiment showing that after unlearning, the model still produces the targeted private information in response to direct or indirect queries about the suppressed entity, or exhibits measurable degradation on standard non-target tasks.
Figures
read the original abstract
As large language models (LLMs) are increasingly deployed in real-world systems, they must support post-hoc removal of specific content to meet privacy and governance requirements. This motivates selective unlearning, which suppresses information about a particular entity or topic while preserving the LLM's general utility. However, most existing LLM unlearning methods require access to the original training corpus and rely on output-level refusal tuning or broad gradient updates, creating a tension among unlearning strength, non-target preservation, and data availability. We propose Geometric Unlearning (GU), an approach that operates directly on the model's prompt-conditioned hidden states without access to the original training corpus. Specifically, GU distills a compact, low-rank safe-behavior subspace from a small set of safe reference prompts and uses lightweight anchor-in-context synthetic prompts to trigger localized, projection-based alignment of hidden representations to this safe subspace. A teacher-distillation regularizer on synthetic non-target anchors further reduces collateral drift. Across privacy-oriented unlearning benchmarks (ToFU and UnlearnPII), GU achieves strong target suppression with minimal impact on non-target performance, demonstrating that effective unlearning can be achieved with minimal synthetic data.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Geometric Unlearning (GU) for selective unlearning in LLMs without access to the original training corpus. GU distills a compact low-rank safe-behavior subspace from a small set of safe reference prompts, then applies localized projection-based alignment of hidden states triggered by lightweight anchor-in-context synthetic prompts, with a teacher-distillation regularizer on synthetic non-target anchors to limit drift. Evaluations on ToFU and UnlearnPII benchmarks claim strong target suppression with minimal non-target impact, showing effective unlearning is possible with minimal synthetic data.
Significance. If the central claims hold under scrutiny, the work would be significant for privacy-oriented LLM governance: it offers a data-minimal alternative to corpus-dependent or broad-gradient unlearning methods while preserving utility, potentially easing the tension between unlearning strength and data availability.
major comments (2)
- [Abstract] Abstract: The abstract provides no equations, implementation details, or quantitative results beyond high-level claims, so it is impossible to assess whether the described projection and distillation steps actually support the suppression claim.
- [Method] Method: The construction assumes target-specific information is linearly separable from safe behavior in hidden-state space and that the low-rank subspace distilled from limited safe prompts will isolate and erase it; no analysis or test is provided to establish this separability for entangled factual knowledge, so the subspace could capture only generic refusal patterns while leaving target facts intact under rephrasing or indirect prompting.
minor comments (1)
- The description of how synthetic anchors are generated and how the teacher-distillation regularizer is weighted should be expanded for reproducibility.
Simulated Author's Rebuttal
Thank you for the detailed review. We respond to each major comment below, indicating planned revisions to the manuscript.
read point-by-point responses
-
Referee: [Abstract] Abstract: The abstract provides no equations, implementation details, or quantitative results beyond high-level claims, so it is impossible to assess whether the described projection and distillation steps actually support the suppression claim.
Authors: We agree with this observation. The revised abstract will incorporate a concise description of the projection-based alignment and key quantitative results from the ToFU and UnlearnPII benchmarks to substantiate the claims. revision: yes
-
Referee: [Method] Method: The construction assumes target-specific information is linearly separable from safe behavior in hidden-state space and that the low-rank subspace distilled from limited safe prompts will isolate and erase it; no analysis or test is provided to establish this separability for entangled factual knowledge, so the subspace could capture only generic refusal patterns while leaving target facts intact under rephrasing or indirect prompting.
Authors: While the method is supported by strong empirical performance on the benchmarks, we acknowledge that an explicit analysis of the linear separability assumption is absent from the current manuscript. We will add such an analysis, including tests for robustness against rephrasing and indirect prompts, in the revised version. revision: yes
Circularity Check
No circularity: method is a self-contained algorithmic proposal
full rationale
The paper introduces Geometric Unlearning as a new procedure that distills a low-rank subspace from safe reference prompts and performs projection alignment on synthetic anchors, with a teacher regularizer. No equations, parameter-fitting steps, or self-citations are shown in the abstract or described claims that reduce any prediction or uniqueness result to the inputs by construction. The central claims rest on the empirical performance of the proposed algorithm rather than any definitional or self-referential reduction.
Axiom & Free-Parameter Ledger
Forward citations
Cited by 1 Pith paper
-
Null-Space Constrained Low-Rank Adaptation for Response-Specified Large Language Model Unlearning
NSRU constrains LoRA updates via null-space projection of retain subspaces to jointly optimize safe-target learning, undesired-response suppression, and retention in LLM unlearning.
Reference graph
Works this paper leans on
-
[1]
the person named (ANCHOR)
INDIRECT: refer to (ANCHOR) indirectly (e.g.,“the person named (ANCHOR)”, pronoun-based refer- ence) while still including the anchor string at least once. • Allocation rule (no choice):If N is bigger than 8, produce exactly 1 prompt per bucket for the first 8 prompts; for any remaining prompts, repeat the bucket order (BIO→FACT→. . . ) until reachingN. •...
-
[2]
same first name token (e.g., identical given name)
-
[3]
same last name token
-
[4]
A.”-style)
same initials pattern (e.g., “N. A.”-style)
-
[5]
Coverage requirements (must satisfy all): • Name-group ratio (exact):Exactly 50% of samples must use confusable names and 50% must use unrelated names
shared prefix of length≥4on one token. Coverage requirements (must satisfy all): • Name-group ratio (exact):Exactly 50% of samples must use confusable names and 50% must use unrelated names. IfN retis odd, allocate the extra one to unrelated names. • Per-name balance (exact):For each list, distribute samples as evenly as possible across names (difference ...
-
[6]
fictional two-sentence bio request
-
[7]
fictional role + signature project request
-
[8]
fictional timeline request (3 bullet points)
-
[9]
fictional occupation request
-
[10]
neutral mention inside an unrelated task (e.g.,meeting notes)
-
[11]
question
short-term (30-day) plan request (3 bullet points). Answer constraints: • All answers must be clearly fictional and non-verifiable. • Answers must be short (1–3 sentences, or 3 bullets when requested). • Do not include any real-world claims, citations, or references. Output format:Output a JSON list of objects with keys{"question","answer"}. 21
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.