From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning

Haibo Hu; Minxin Du; Peizhao Hu; Qingqing Ye; Shiyu Zhang; Xiaoyu Xu; Zhibiao Guo; Zi Liang; Zitong Li

arxiv: 2601.04278 · v2 · submitted 2026-01-07 · 💻 cs.CL · cs.AI· cs.CR· cs.LG

From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning

Xiaoyu Xu , Minxin Du , Zitong Li , Zi Liang , Zhibiao Guo , Shiyu Zhang , Peizhao Hu , Qingqing Ye

show 1 more author

Haibo Hu

This is my paper

Pith reviewed 2026-05-16 16:30 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CRcs.LG

keywords LLM unlearningdata synthesismachine unlearningforget set generationdomain-level unlearninginstance-level unlearningadversarial prompting

0 comments

The pith

BiForget synthesizes forget sets for LLM unlearning by prompting the target model itself to match its internal knowledge.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper formalizes two granularities of unlearning: domain-level and instance-level. It proposes BiForget, which generates high-quality forget sets by using the target LLM with seed-guided and adversarial prompting instead of external generators. This produces data with better relevance and diversity while using half the size of prior methods. The result is more robust forgetting of unwanted content like private or copyrighted material and improved retention of model capabilities.

Core claim

BiForget exploits the target model to elicit forget data that matches its internal knowledge distribution through seed-guided and adversarial prompting, achieving a superior balance of relevance, diversity, and efficiency across benchmarks, such as improving relevance by about 20 and diversity by 0.05 while halving data size in the Harry Potter domain.

What carries the argument

BiForget framework that uses seed-guided and adversarial prompting on the target model to synthesize dual-granularity forget sets matching its internal distribution.

If this is right

Facilitates more robust forgetting of specific domains or instances.
Preserves model utility better by using smaller, more targeted forget sets.
Enables more rigorous evaluation of unlearning methods due to faithful data synthesis.
Reduces reliance on external data sources for unlearning benchmarks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Applying self-elicited data could minimize distribution shifts in other model modification tasks.
Testing BiForget on additional domains like medical or code data would reveal its generalizability.
Combining it with existing unlearning algorithms might further enhance performance.

Load-bearing premise

Prompted data from the target model faithfully represents its internal knowledge distribution without introducing bias or hallucination.

What would settle it

Measuring unlearning success on instances not used in synthesis; if BiForget forget sets do not outperform external ones in forgetting accuracy and utility preservation, the advantage would be disproven.

read the original abstract

Although machine unlearning is essential for removing private, harmful, or copyrighted content from LLMs, current benchmarks often fail to faithfully represent the true ``forgetting scope'' learned by the model. We formalize two distinct unlearning granularities, domain-level and instance-level, and propose \BiForget, an automated framework for synthesizing high-quality forget sets. Unlike prior work relying on \emph{external} generators, \BiForget exploits the target model per se to elicit data that matches its internal knowledge distribution through seed-guided and adversarial prompting. Our experiments across diverse benchmarks show that it achieves a superior balance of relevance, diversity, and efficiency. Quantitatively, in the Harry Potter domain, it improves relevance by ${\sim}20$ and diversity by ${\sim}$0.05 while \emph{halving} the total data size compared to SOTAs. Ultimately, it facilitates more robust forgetting and better utility preservation, providing a more rigorous foundation for evaluating LLM unlearning.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

BiForget generates forget sets by prompting the target LLM itself at domain and instance levels, a practical shift that may improve alignment but rests on shaky assumptions about what the model actually reveals.

read the letter

The paper formalizes unlearning at two granularities and introduces BiForget, which elicits forget data directly from the target model using seed-guided and adversarial prompts instead of external generators. This targets a real gap where standard benchmarks often use data that does not match what the model has internalized. The Harry Potter results claim higher relevance, modest diversity gains, and half the data volume compared with prior sets, which then supports stronger forgetting with less utility drop. Experiments across several benchmarks add some breadth. The approach is straightforward enough that labs working on safety could try it without heavy new infrastructure. The soft spot is the core assumption that model-elicited data faithfully reflects internal knowledge. Prompting can produce refusals, hallucinations, or incomplete coverage on sensitive material, so the generated sets might leave gaps that later unlearning evaluations miss. The abstract gives headline numbers but no prompt templates, variance estimates, or controls for model-specific artifacts, which makes it hard to judge how much the gains depend on the particular setup. No circularity appears in the method itself. This work is aimed at researchers building or auditing unlearning pipelines for deployed models. A reader focused on practical evaluation fixes would find the dual-granularity framing and the self-elicitation idea worth examining, even if the numbers need tighter validation. It deserves peer review to inspect the full experimental details and test whether the prompting step holds up under closer scrutiny.

Referee Report

2 major / 2 minor

Summary. The paper introduces BiForget, an automated framework for synthesizing forget sets for LLM unlearning at dual granularities (domain-level and instance-level). It uses the target model itself via seed-guided and adversarial prompting to generate data that matches its internal knowledge distribution, claiming superior relevance, diversity, and efficiency over prior methods relying on external generators. Experiments on benchmarks like the Harry Potter domain report improvements of approximately 20 in relevance and 0.05 in diversity while halving data size, leading to better forgetting robustness and utility preservation.

Significance. If the experimental claims hold after proper validation, BiForget could advance LLM unlearning research by addressing gaps in benchmark fidelity through model-elicited data synthesis. The dual-granularity formalization is a clear conceptual contribution, and the efficiency gains (halved data size) would be practically useful if reproducibility is established. The work is grounded in a plausible hypothesis but currently lacks the experimental rigor needed for strong impact.

major comments (2)

[Abstract and Experiments] Abstract and Experiments section: The reported gains (relevance improved by ~20, diversity by ~0.05, data size halved) are presented without defining the underlying metrics, listing exact baseline methods and their scores, reporting statistical tests or variance, or describing prompting templates and controls for model artifacts; this directly weakens the central claim of superior balance.
[Methodology] Methodology (BiForget framework description): The assumption that seed-guided and adversarial prompting of the target model produces data faithfully matching its internal knowledge distribution is load-bearing for all downstream claims but receives no validation, such as overlap analysis with known memorized content, human coverage checks, or ablation on hallucination rates.

minor comments (2)

[Throughout] Ensure consistent formatting of the framework name (BiForget) and any equations for data synthesis across sections.
[Related Work] Add explicit references to prior unlearning benchmarks and external-generator baselines in the related-work discussion.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their detailed feedback on our manuscript. We appreciate the opportunity to clarify and strengthen our presentation of the BiForget framework and its experimental validation. Below, we address each major comment point by point.

read point-by-point responses

Referee: [Abstract and Experiments] Abstract and Experiments section: The reported gains (relevance improved by ~20, diversity by ~0.05, data size halved) are presented without defining the underlying metrics, listing exact baseline methods and their scores, reporting statistical tests or variance, or describing prompting templates and controls for model artifacts; this directly weakens the central claim of superior balance.

Authors: We agree that additional details are necessary to fully substantiate the reported gains. In the revised version, we will define the relevance metric (as the average semantic similarity score between generated forget instances and the model's internal knowledge representations, measured via embedding cosine similarity) and the diversity metric (as the normalized entropy of n-gram distributions across the dataset). We will include a comprehensive table in the experiments section listing all baseline methods with their exact scores for relevance, diversity, and data size. Statistical significance will be reported using paired t-tests with p-values and standard deviations across 5 random seeds. Prompting templates and controls (e.g., temperature settings, repetition penalties to mitigate artifacts) will be detailed in a new appendix section. These changes will make the superiority claims more transparent and verifiable. revision: yes
Referee: [Methodology] Methodology (BiForget framework description): The assumption that seed-guided and adversarial prompting of the target model produces data faithfully matching its internal knowledge distribution is load-bearing for all downstream claims but receives no validation, such as overlap analysis with known memorized content, human coverage checks, or ablation on hallucination rates.

Authors: The design of BiForget is predicated on the target model's ability to generate data reflective of its learned knowledge through targeted prompting. While the current manuscript relies on downstream unlearning performance as indirect evidence, we recognize the value of direct validation. We will incorporate in the revised manuscript: (1) an overlap analysis measuring the intersection between generated forget sets and a set of known memorized facts from the target model (using both exact string matching and semantic similarity thresholds); (2) results from human annotators assessing coverage of domain knowledge; and (3) an ablation study quantifying hallucination rates by evaluating the factual correctness of generated instances against ground-truth sources. These additions will provide direct support for the assumption while preserving the original experimental outcomes. revision: yes

Circularity Check

0 steps flagged

No circularity in BiForget synthesis or evaluation chain

full rationale

The paper defines BiForget as an automated synthesis framework that elicits forget data from the target LLM itself via seed-guided and adversarial prompting to match its internal distribution, then empirically compares resulting relevance (~20 gain), diversity (~0.05 gain), and halved data size against external SOTA baselines on benchmarks such as Harry Potter. No equations, fitted parameters, or self-citations are shown that would make the claimed improvements reduce to the inputs by construction; the synthesis method and quantitative metrics remain independent of the downstream unlearning results.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 1 invented entities

The central claim rests on the domain assumption that model-elicited data accurately reflects internal knowledge; no explicit free parameters or invented entities beyond the proposed framework itself.

axioms (1)

domain assumption Responses from the target model to seed-guided and adversarial prompts accurately reflect its internal knowledge distribution.
This underpins the entire self-elicitation strategy and is invoked to justify superiority over external generators.

invented entities (1)

BiForget framework no independent evidence
purpose: Automated dual-granularity forget-set synthesis via model self-elicitation.
Newly introduced method whose independent evidence is limited to the reported experiments.

pith-pipeline@v0.9.0 · 5501 in / 1175 out tokens · 37520 ms · 2026-05-16T16:30:29.789274+00:00 · methodology

From Domains to Instances: Dual-Granularity Data Synthesis for LLM Unlearning

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)