The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage

Abhilasha Ravichander; Jaehun Jung; Melanie Sclar; Niloofar Mireshghallah; Sahana Ramnath; Sai Praneeth Karimireddy; Skyler Hallinan; Xiang Ren; Ximing Lu; Yejin Choi

arxiv: 2508.09603 · v2 · submitted 2025-08-13 · 💻 cs.CL

The Surprising Effectiveness of Membership Inference with Simple N-Gram Coverage

Skyler Hallinan , Jaehun Jung , Melanie Sclar , Ximing Lu , Abhilasha Ravichander , Sahana Ramnath , Yejin Choi , Sai Praneeth Karimireddy

show 2 more authors

Niloofar Mireshghallah Xiang Ren

This is my paper

Pith reviewed 2026-05-18 22:54 UTC · model grok-4.3

classification 💻 cs.CL

keywords membership inferencelanguage modelsblack-box attackn-gram overlapdata leakageprivacymodel auditing

0 comments

The pith

N-gram overlap from multiple generations detects training data membership using only model text outputs.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a membership inference method that checks whether a candidate text appeared in a language model's training data. It generates several continuations from a short prefix of the text and measures how much their n-grams match the actual suffix. High match rates signal likely membership because models reproduce training patterns more readily. The approach works on fully black-box models that return only text, yet it matches or exceeds the accuracy of attacks that require internal probabilities or hidden states. This enables direct checks on widely deployed API models for unintended data exposure.

Core claim

The central claim is that language models memorize and reproduce n-gram sequences from their training data, so the degree of n-gram coverage between multiple prefix-conditioned generations and the ground-truth suffix serves as a reliable indicator of whether the candidate text was a training member.

What carries the argument

N-Gram Coverage Attack, which samples multiple model outputs conditioned on a prefix and aggregates n-gram overlap scores with the true suffix to produce a membership score.

If this is right

Performance rises steadily as the number of generated sequences from the target model increases.
More recent models exhibit lower attack success rates, indicating stronger built-in resistance to membership inference.
Closed commercial models can now be audited for data leakage without internal access.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same n-gram coverage signal could be tested on non-language sequence models to check for memorization.
Training procedures that penalize n-gram repetition might reduce vulnerability to this form of inference.
Auditors could combine this method with known non-member texts to calibrate decision thresholds for specific domains.

Load-bearing premise

That elevated n-gram overlap specifically reflects training-set membership rather than the model's general fluency or effects from the chosen prefix.

What would settle it

Apply the attack to a set of texts known to be absent from training yet stylistically similar to training data and measure whether the method still assigns high membership scores.

read the original abstract

Membership inference attacks serves as useful tool for fair use of language models, such as detecting potential copyright infringement and auditing data leakage. However, many current state-of-the-art attacks require access to models' hidden states or probability distribution, which prevents investigation into more widely-used, API-access only models like GPT-4. In this work, we introduce N-Gram Coverage Attack, a membership inference attack that relies solely on text outputs from the target model, enabling attacks on completely black-box models. We leverage the observation that models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data. Specifically, to make a prediction on a candidate member, N-Gram Coverage Attack first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix; high similarities indicate likely membership. We first demonstrate on a diverse set of existing benchmarks that N-Gram Coverage Attack outperforms other black-box methods while also impressively achieving comparable or even better performance to state-of-the-art white-box attacks - despite having access to only text outputs. Interestingly, we find that the success rate of our method scales with the attack compute budget - as we increase the number of sequences generated from the target model conditioned on the prefix, attack performance tends to improve. Having verified the accuracy of our method, we use it to investigate previously unstudied closed OpenAI models on multiple domains. We find that more recent models, such as GPT-4o, exhibit increased robustness to membership inference, suggesting an evolving trend toward improved privacy protections.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

N-gram overlap from multiple generations gives a workable black-box membership inference attack that beats other black-box baselines and holds up against some white-box ones on benchmarks, but the signal may partly reflect fluency rather than memorization.

read the letter

The main takeaway is that this paper introduces a black-box membership inference attack called N-Gram Coverage Attack. It generates multiple continuations from a prefix of a candidate example, then scores membership by aggregating n-gram overlap with the observed suffix. The reported results show it outperforming prior black-box methods and reaching or exceeding some white-box attacks on existing benchmarks, while also scaling with more generations and producing new measurements on closed OpenAI models like GPT-4o, where newer versions appear more resistant.

Referee Report

2 major / 2 minor

Summary. The paper introduces the N-Gram Coverage Attack, a black-box membership inference method for language models that requires only text outputs. Given a candidate sequence, it extracts a prefix, generates multiple continuations from the target model, and aggregates n-gram overlap scores between those generations and the observed suffix; high aggregate overlap is taken as evidence of membership. The authors report that the attack outperforms prior black-box baselines and reaches performance comparable to white-box methods across diverse benchmarks, that success improves with larger generation budgets, and that the method reveals greater robustness in newer closed models such as GPT-4o.

Significance. If the empirical claims hold after addressing controls and reporting details, the result is significant: it demonstrates that a simple, low-resource text-only procedure can achieve strong membership inference, lowering the barrier for auditing API-only models. The scaling of attack success with generation count is a concrete, actionable observation. Application to real closed models adds practical relevance for privacy auditing.

major comments (2)

[§3] §3 (attack procedure): the aggregation of n-gram overlaps across k generations does not include or report baseline overlap rates measured on non-member examples or on synthetic prompts matched for length, perplexity, and domain. Without such controls, it is unclear whether elevated overlap reflects training-data memorization rather than the model's ordinary fluency on the prompt style, which directly affects the validity of the membership signal.
[§4] §4 (experimental results): the manuscript reports strong outperformance but does not supply error bars, the precise aggregation rule across benchmarks, or the exact values of k and n-gram order used in each table. These omissions are load-bearing for the central claim that the method rivals white-box attacks.

minor comments (2)

[Abstract] Abstract: the phrase 'diverse set of existing benchmarks' should name the specific datasets or tasks for immediate clarity.
[§3] Notation: the precise formula for aggregating the per-generation overlap scores into a single membership score is not stated explicitly; a short equation would remove ambiguity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and detailed comments. We address each major comment below and describe the revisions we will make to improve the manuscript's clarity and rigor.

read point-by-point responses

Referee: [§3] §3 (attack procedure): the aggregation of n-gram overlaps across k generations does not include or report baseline overlap rates measured on non-member examples or on synthetic prompts matched for length, perplexity, and domain. Without such controls, it is unclear whether elevated overlap reflects training-data memorization rather than the model's ordinary fluency on the prompt style, which directly affects the validity of the membership signal.

Authors: We agree that reporting explicit baseline overlap rates would strengthen the interpretation of the membership signal. While our benchmark evaluations already contrast performance on member versus non-member data, we did not include raw baseline statistics for non-members or matched synthetic prompts. In the revised manuscript we will add these controls, reporting average n-gram overlap rates on non-member examples from the existing benchmarks together with results on length-, perplexity-, and domain-matched synthetic prompts. This material will appear in Section 3 to demonstrate that elevated overlaps are specific to training-data exposure rather than general fluency. revision: yes
Referee: [§4] §4 (experimental results): the manuscript reports strong outperformance but does not supply error bars, the precise aggregation rule across benchmarks, or the exact values of k and n-gram order used in each table. These omissions are load-bearing for the central claim that the method rivals white-box attacks.

Authors: We acknowledge that greater detail is needed for reproducibility and to support the central performance claims. In the revised version we will add error bars to all tables and figures, explicitly describe the aggregation rule used across benchmarks, and state the precise values of k and n-gram order employed in each reported experiment. These changes will be placed in Section 4 and the associated table captions. revision: yes

Circularity Check

0 steps flagged

No circularity: empirical heuristic evaluated on external benchmarks

full rationale

The paper introduces N-Gram Coverage Attack as a black-box heuristic that generates k continuations from a prefix and aggregates n-gram overlap with the observed suffix to score membership. This procedure is defined directly from the attack design and its success rates are measured on diverse existing benchmarks and closed models rather than derived from any fitted parameter or self-referential equation. No load-bearing step reduces the reported performance to a quantity defined from the same data by construction, and the central claims rest on experimental results rather than a mathematical derivation that collapses into its inputs.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The attack rests on the domain assumption that training data leaves detectable n-gram signatures in generation behavior; no free parameters or invented entities are stated in the abstract.

axioms (1)

domain assumption Models are more likely to memorize and subsequently generate text patterns that were commonly observed in their training data.
Invoked to justify why n-gram overlap should indicate membership.

pith-pipeline@v0.9.0 · 5871 in / 1221 out tokens · 26267 ms · 2026-05-18T22:54:00.664610+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

N-GRAM COVERAGE ATTACK first obtains multiple model generations conditioned on a prefix of the candidate. It then uses n-gram overlap metrics to compute and aggregate the similarities of these outputs with the ground truth suffix

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.