arxiv: 2604.07238 · v1 · submitted 2026-04-08 · 💻 cs.LG · cs.CL· cs.CR· cs.DS

Recognition: 2 theorem links

· Lean Theorem

On the Price of Privacy for Language Identification and Generation

Xiaoyu Li , Andi Han , Jiaojiao Jiang , Junbin Gao

Authors on Pith no claims yet

Pith reviewed 2026-05-10 18:49 UTC · model grok-4.3

classification 💻 cs.LG cs.CLcs.CRcs.DS

keywords differential privacylanguage identificationlanguage generationagnostic learningerror exponentspure DPapproximate DPprivacy cost

0 comments

The pith

Approximate differential privacy recovers the non-private error rates for language identification and generation, while pure DP multiplies the exponents by min{1, ε}.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

Large language models are often trained on sensitive user data, making it essential to understand the performance penalty from enforcing differential privacy. The paper studies both language identification and language generation in the agnostic statistical setting and constructs algorithms that achieve the same exponential error decay as non-private learning when using approximate (ε, δ)-DP with any fixed ε > 0. Identification error remains exp(-r(n)) for any r(n) = o(n), and generation error remains exp(-Ω(n)). Under pure ε-DP the exponents shrink by the factor min{1, ε}, and matching lower bounds show this degradation is tight up to constants, with exact optimality for generation under mild assumptions. The work therefore quantifies a surprisingly small price for privacy in these core language tasks.

Core claim

In the agnostic setting, differentially private algorithms exist for language identification and generation that achieve error rates of exp(-r(n)) for identification (any r(n) = o(n)) and exp(-Ω(n)) for generation under approximate (ε, δ)-DP with constant ε > 0, exactly matching the non-private rates. Under pure ε-DP the rates become exp(-min{1, ε} · r(n)) and exp(-min{1, ε} · Ω(n)), and these are shown to be tight via matching lower bounds; for generation the upper and lower bounds coincide up to constants under mild assumptions, establishing an optimal rate.

What carries the argument

DP algorithms together with information-theoretic lower bounds that establish the precise multiplicative degradation min{1, ε} in the error exponent under pure privacy.

If this is right

Language identification can be performed with privacy while retaining error decay exp(-r(n)) for any sublinear r(n).
Language generation retains exponential error decay exp(-Ω(n)) under approximate DP with constant ε.
Pure DP imposes only a constant-factor reduction in the exponent, allowing precise privacy-utility budgeting.
The matching lower bounds imply that no algorithm can improve the pure-DP rates by more than constants.
Constant-ε approximate DP is effectively free for these exponential-rate tasks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The quantified mild cost may support wider use of DP when training LLMs on private text corpora.
Analogous rate calculations could apply to other sequence tasks such as machine translation.
Verifying the mild assumptions on real text distributions would make the optimal pure-DP rates directly usable in practice.
The same exponent-multiplier phenomenon may appear in privacy analyses of other autoregressive models.

Load-bearing premise

The analysis requires the agnostic statistical setting for both tasks and mild assumptions on the generation task under pure DP; if these fail the matching rates may not apply.

What would settle it

An explicit construction or dataset on which the minimal error probability for pure-ε-DP language generation is asymptotically larger than exp(-min{1, ε} Ω(n)) by more than constant factors would falsify the claimed upper bound.

read the original abstract

As large language models (LLMs) are increasingly trained on sensitive user data, understanding the fundamental cost of privacy in language learning becomes essential. We initiate the study of differentially private (DP) language identification and generation in the agnostic statistical setting, establishing algorithms and matching lower bounds that precisely quantify the cost of privacy. For both tasks, approximate $(\varepsilon, \delta)$-DP with constant $\varepsilon > 0$ recovers the non-private error rates: $\exp(-r(n))$ for identification (for any $r(n) = o(n)$) and $\exp(-\Omega(n))$ for generation. Under pure $\varepsilon$-DP, the exponents degrade by a multiplicative factor of $\min\{1, \varepsilon\}$, which we show is tight up to constants. Notably, for generation under pure DP with mild assumptions, the upper bound $\exp(-\min\{1,\varepsilon\} \cdot \Omega(n))$ matches the lower bound up to some constants, establishing an optimal rate. Our results show that the cost of privacy in language learning is surprisingly mild: absent entirely under approximate DP, and exactly a $\min\{1,\varepsilon\}$ factor in the exponent under pure DP.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

Approximate DP with constant ε leaves the exponential rates for language ID and generation unchanged while pure DP multiplies the exponent by min(1,ε), and the paper supplies the first matching bounds in the agnostic setting.

read the letter

The main takeaway is that adding approximate differential privacy with fixed ε > 0 does not change the leading exponential error rates for language identification or generation, while pure ε-DP only scales those exponents by min(1, ε). The paper shows both directions with matching upper and lower bounds, so the claimed cost is tight up to constants under the stated conditions for generation.

Referee Report

1 major / 3 minor

Summary. The paper initiates the study of differentially private language identification and generation in the agnostic statistical setting. It provides algorithms and matching lower bounds showing that approximate (ε, δ)-DP with constant ε > 0 recovers the non-private error rates of exp(-r(n)) for identification (any r(n) = o(n)) and exp(-Ω(n)) for generation. Under pure ε-DP the exponents degrade by a multiplicative factor of min{1, ε}, shown to be tight up to constants; for generation under pure DP with mild assumptions the upper and lower bounds match up to constants, establishing optimality.

Significance. If the matching bounds hold, the work is significant for providing the first precise quantification of privacy costs in language tasks. The result that approximate DP incurs no asymptotic penalty (recovering exp(-r(n)) and exp(-Ω(n)) rates) and that pure DP incurs exactly a min{1, ε} factor in the exponent (tight for generation) has direct implications for private LLM training. Credit is due for establishing matching upper and lower bounds and for the parameter-free character of the leading exponential terms under approximate DP.

major comments (1)

[Generation under pure DP (likely §4 or §5)] The mild assumptions required for the generation task under pure ε-DP (to obtain matching upper and lower bounds) are load-bearing for the optimality claim; they must be stated explicitly, shown to be necessary, and verified to hold in the agnostic setting without restricting the central result.

minor comments (3)

[Introduction / Preliminaries] Clarify the precise definition of the agnostic statistical setting and how it interacts with the language identification and generation error measures early in the paper.
[Abstract and main theorems] The statement that the degradation factor is 'tight up to constants' should include a brief remark on whether the constants are independent of n or depend on other parameters.
[Preliminaries] Ensure all notation for the rates (e.g., r(n), Ω(n)) is defined consistently before the first use of the main theorems.

Simulated Author's Rebuttal

1 responses · 0 unresolved

We thank the referee for the positive evaluation and the recommendation of minor revision. The single major comment is addressed below; we will incorporate the requested clarifications into the revised manuscript.

read point-by-point responses

Referee: [Generation under pure DP (likely §4 or §5)] The mild assumptions required for the generation task under pure ε-DP (to obtain matching upper and lower bounds) are load-bearing for the optimality claim; they must be stated explicitly, shown to be necessary, and verified to hold in the agnostic setting without restricting the central result.

Authors: We agree that the assumptions deserve more prominent and self-contained treatment. In the current draft they appear inline within the statement of the relevant theorem on pure-DP generation. In the revision we will: (i) extract them into an explicit, bulleted list placed immediately before the theorem; (ii) add a short paragraph explaining why each assumption is necessary for the upper and lower bounds to match up to constants (they ensure that the pure-DP constraint does not introduce extra logarithmic or polynomial factors in the exponent while preserving the linear dependence on n); and (iii) verify that the assumptions hold in the agnostic setting—they impose only the standard finite-alphabet and finite-support conditions already required for the language-identification and generation problems to be well-posed, without restricting the class of distributions or the central non-private rates. These changes are purely expository and do not alter any theorems or proofs. revision: yes

Circularity Check

0 steps flagged

No significant circularity; bounds derived independently

full rationale

The paper establishes algorithms achieving non-private exponential rates under approximate (ε,δ)-DP and proves matching lower bounds that quantify the exact multiplicative degradation min{1,ε} under pure ε-DP. Upper bounds rely on explicit mechanisms whose privacy overhead does not alter leading terms for constant ε; lower bounds are shown separately via information-theoretic arguments that do not reduce to the upper-bound constructions or to self-citations. No self-definitional steps, fitted parameters renamed as predictions, or load-bearing self-citations appear in the derivation chain. The result is self-contained against external statistical benchmarks.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

The central claims rest on the standard definitions of approximate and pure differential privacy plus the agnostic statistical model for language tasks; no new entities are introduced.

free parameters (2)

ε
Privacy budget parameter that scales the exponent under pure DP.
δ
Failure probability parameter in approximate DP.

axioms (2)

domain assumption Agnostic statistical setting where the true distribution is unknown
Invoked throughout to define the learning tasks and error rates.
ad hoc to paper Mild assumptions for generation under pure DP
Required for the tight matching of upper and lower bounds in generation.

pith-pipeline@v0.9.0 · 5515 in / 1382 out tokens · 47066 ms · 2026-05-10T18:49:46.112542+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

approximate (ε,δ)-DP recovers non-private rates exp(−r(n))... pure ε-DP degrades by min{1,ε}... exponential mechanism... Gaussian mechanism... coupling lemma
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

margin-based selection... thresholded prefix count... sensitivity analysis

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Contrastive Identification and Generation in the Limit
cs.LG 2026-05 unverdicted novelty 8.0

Contrastive pair presentations yield exact identifiability characterizations via a geometric refinement of Angluin's condition, a new contrastive closure dimension for generation, mutual incomparability with text iden...

Reference graph

Works this paper leans on

2 extracted references · 1 canonical work pages · cited by 1 Pith paper

[1]

Borja Balle and Yu-Xiang Wang

doi: 10.1137/1.9781611978971.31. Borja Balle and Yu-Xiang Wang. Improving the Gaussian mechanism for differential privacy: Analytical calibration and optimal denoising. InProceedings of the 35th International Conference on Machine Learning (ICML), pages 394–403. PMLR, 2018. Amos Beimel, Kobbi Nissim, and Uri Stemmer. Private learning and sanitization: Pur...

work page doi:10.1137/1.9781611978971.31 2018
[2]

Computational and sample-complexity barriers were analyzed by Arenas et al

investigated automated hallucination detection. Computational and sample-complexity barriers were analyzed by Arenas et al. [2025]. More recent extensions include generation in continuous metric spaces [Li et al., 2026] and safe generation [Anastasopoulos et al., 2026]. Racca et al. [2026] study model collapse from a learning-theoretic perspective, showin...

2025