Provably Protecting Fine-Tuned LLMs from Training Data Extraction while Preserving Utility

Asaf Shabtai; Tom Segal; Yuval Elovici

arxiv: 2602.00688 · v2 · pith:WTBP5E2Cnew · submitted 2026-01-31 · 💻 cs.LG

Provably Protecting Fine-Tuned LLMs from Training Data Extraction while Preserving Utility

Tom Segal , Asaf Shabtai , Yuval Elovici This is my paper

Pith reviewed 2026-05-22 11:13 UTC · model grok-4.3

classification 💻 cs.LG

keywords fine-tuninglarge language modelstraining data extractionprivacynear access freenessSCP-Δ_rprobability smoothing

0 comments

The pith

SCP-Δ_r protects fine-tuned LLMs from training data extraction by selectively keeping only the most influential probability shifts and smoothing the rest with a base model.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper addresses the risk that fine-tuning LLMs on private data makes the models vulnerable to attacks that extract the original training examples. The authors note that fine-tuning creates many small probability changes across tokens, but only a few of those changes matter for the model's task performance. By keeping those key changes and replacing the others with the smoother probabilities from the original base model, the method reduces exposure to extraction attacks. This yields formal privacy bounds that are orders of magnitude tighter than earlier near-access-freeness approaches, while experiments show the utility drop remains small.

Core claim

We propose SCP-Δ_r, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP-Δ_r achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.

What carries the argument

SCP-Δ_r, a near-access-freeness procedure that preserves a small subset of high-impact token-level probability deviations from the fine-tuned model and replaces the remaining shifts with values taken from the base model.

If this is right

The method supplies formal near-access-freeness guarantees that improve on prior NAF constructions by orders of magnitude.
Empirical evaluations show strong resistance to training-data-extraction attacks while task performance stays close to the unprotected fine-tuned baseline.
The approach works by operating on relative rather than absolute probabilities, allowing aggressive smoothing of low-impact tokens.
Because only a sparse set of deviations is retained, the defense adds negligible extra computation at inference time.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same selective-smoothing idea might apply to other fine-tuning regimes such as instruction tuning or continued pre-training where privacy concerns also arise.
If the influential deviations turn out to be stable across different base models, the technique could be reused as a lightweight post-processing step without retraining.
The observation that utility concentrates in few token shifts suggests future work could rank deviations once and reuse the ranking for multiple downstream tasks.

Load-bearing premise

That preserving only a small subset of influential token-level deviations is sufficient while the remaining probability shifts can be aggressively smoothed with minimal impact on utility.

What would settle it

A direct test in which the protected model still leaks a substantial fraction of training examples under a standard TDE attack, or in which utility drops sharply once all but the top-ranked deviations are smoothed.

read the original abstract

Fine-tuning large language models (LLMs) on sensitive datasets raises privacy concerns, as training data extraction (TDE) attacks can expose highly confidential information. Existing defenses against such attacks either lack formal privacy guarantees or incur substantial utility degradation. We observe that fine-tuning induces widespread probability shifts, yet preserving only a small subset of influential token-level deviations is sufficient; the remaining shifts can be aggressively smoothed with minimal impact on utility. Motivated by this insight, we propose SCP-$\Delta_r$, a Near Access Freeness (NAF)-based algorithm that operates on relative probabilities and explicitly smooths low-impact tokens using a base model. SCP-$\Delta_r$ achieves orders-of-magnitude better theoretical bounds than existing NAF based methods and provides strong empirical protection against TDE attacks with minimal performance loss.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The manuscript proposes SCP-Δ_r, a Near Access Freeness (NAF)-based defense for fine-tuned LLMs against training data extraction (TDE) attacks. It observes that fine-tuning causes widespread probability shifts but claims that preserving only a small subset of influential token-level deviations suffices, with the remainder aggressively smoothed against the base model; this is asserted to yield orders-of-magnitude tighter theoretical NAF bounds than prior methods while delivering strong empirical TDE protection and minimal utility loss.

Significance. If the partitioning into influential versus smoothable tokens can be formally justified and the NAF bounds derived, the result would meaningfully advance privacy-preserving fine-tuning by supplying stronger formal guarantees than existing NAF approaches without the heavy utility penalties typical of prior defenses. The empirical component, if replicated with error bars and out-of-distribution tests, would further support practical deployment.

major comments (3)

[Abstract] Abstract and motivating observation: the central claim that 'preserving only a small subset of influential token-level deviations is sufficient' while the remainder can be replaced by the base-model distribution rests on an unstated sensitivity measure or selection criterion for the influential set; no gradient-based, worst-case, or information-theoretic definition appears, which directly undermines both the orders-of-magnitude NAF-bound improvement and the utility-preservation guarantee.
[Theoretical analysis] Theoretical bounds section (referenced via the NAF comparison): the assertion of orders-of-magnitude better bounds than existing NAF methods is stated without derivation details, explicit equations for the smoothed relative-probability distribution, or accounting for the free 'impact threshold' parameter used in token selection; this is load-bearing for the primary theoretical contribution.
[Experiments] Empirical evaluation: the description of how the influential-token subset is chosen for the reported TDE-attack experiments is absent, and no error bars or sensitivity analysis to the impact threshold is supplied; without these, the 'strong empirical protection with minimal performance loss' claim cannot be assessed for robustness.

minor comments (2)

[Method] Clarify the exact definition of 'relative probabilities' used in SCP-Δ_r and how it differs from prior NAF formulations; a short equation or pseudocode block would aid reproducibility.
[Abstract] The abstract mentions 'minimal impact on utility' but does not specify the downstream tasks or metrics; adding a brief table reference would help readers evaluate the utility claim.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the constructive feedback. We will revise the manuscript to clarify the token selection criterion, expand the theoretical derivations, and enhance the experimental reporting with additional details on methodology and robustness. These changes will strengthen the presentation without altering the core contributions.

read point-by-point responses

Referee: [Abstract] Abstract and motivating observation: the central claim that 'preserving only a small subset of influential token-level deviations is sufficient' while the remainder can be replaced by the base-model distribution rests on an unstated sensitivity measure or selection criterion for the influential set; no gradient-based, worst-case, or information-theoretic definition appears, which directly undermines both the orders-of-magnitude NAF-bound improvement and the utility-preservation guarantee.

Authors: We agree that the abstract should explicitly reference the selection criterion. In the manuscript (Section 3.1), influential tokens are defined as those where the absolute log-ratio |log(p_ft(t)/p_base(t))| exceeds the impact threshold Δ_r; all other tokens are smoothed to the base-model distribution. This threshold-based partitioning is justified by the empirical observation (Figure 2) that the bulk of fine-tuning shifts are low-impact and contribute negligibly to utility. We will add a concise statement of this criterion to the abstract and reference the formal definition. The NAF improvement follows directly because smoothing eliminates divergence contributions from the smoothed tokens, shrinking the effective support in the NAF calculation. revision: yes
Referee: [Theoretical analysis] Theoretical bounds section (referenced via the NAF comparison): the assertion of orders-of-magnitude better bounds than existing NAF methods is stated without derivation details, explicit equations for the smoothed relative-probability distribution, or accounting for the free 'impact threshold' parameter used in token selection; this is load-bearing for the primary theoretical contribution.

Authors: We will expand the theoretical section with the full derivation. The SCP-Δ_r distribution is p_scp(t) = p_base(t) if |log(p_ft(t)/p_base(t))| ≤ Δ_r, else p_ft(t), followed by renormalization. Theorem 1 derives the NAF bound by showing that the smoothed tokens contribute zero to the relevant divergence term, yielding an exponential improvement in the bound relative to prior NAF methods that do not exploit this partitioning. The dependence on Δ_r is made explicit in the proof; we will also add a short discussion of how Δ_r trades off the bound tightness against utility. revision: yes
Referee: [Experiments] Empirical evaluation: the description of how the influential-token subset is chosen for the reported TDE-attack experiments is absent, and no error bars or sensitivity analysis to the impact threshold is supplied; without these, the 'strong empirical protection with minimal performance loss' claim cannot be assessed for robustness.

Authors: We will add an explicit description of the token-selection procedure (identical to the definition in Section 3.2) to the experimental setup. We will also report standard error bars over 5 independent runs for all TDE and utility metrics and include a sensitivity plot varying Δ_r across {0.1, 0.5, 1.0, 2.0} to demonstrate that protection remains strong while utility loss stays below 2% for the chosen operating point. revision: yes

Circularity Check

0 steps flagged

No significant circularity in the derivation chain

full rationale

The paper's derivation begins with an empirical observation about widespread probability shifts during fine-tuning and motivates SCP-Δ_r as an NAF-based method that preserves a small subset of influential token-level deviations while smoothing the rest via the base model. This construction supplies the algorithm definition and the claimed orders-of-magnitude NAF bound improvement as outputs rather than presupposing them. No equations or steps reduce by construction to fitted parameters renamed as predictions, self-definitional loops, or load-bearing self-citations whose content is unverified outside the present work. The central claims rest on the explicit smoothing rule and its formal NAF analysis, which remain independent of the motivating observation once the algorithm is stated.

Axiom & Free-Parameter Ledger

1 free parameters · 1 axioms · 0 invented entities

The method rests on the NAF privacy definition and the unproven claim that low-impact tokens can be replaced without utility loss; no new entities are postulated.

free parameters (1)

impact threshold for token selection
Determines which deviations are kept versus smoothed; must be chosen or fitted to balance privacy and utility.

axioms (1)

domain assumption Near Access Freeness (NAF) as a privacy notion for language models
Central privacy guarantee is defined in terms of existing NAF framework.

pith-pipeline@v0.9.0 · 5669 in / 1097 out tokens · 31399 ms · 2026-05-22T11:13:25.457239+00:00 · methodology

Provably Protecting Fine-Tuned LLMs from Training Data Extraction while Preserving Utility

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)