Powerful Training-Free Membership Inference Against Autoregressive Language Models

arxiv: 2601.12104 · v2 · submitted 2026-01-17 · 💻 cs.CL · cs.AI· cs.CR

Powerful Training-Free Membership Inference Against Autoregressive Language Models

David Ili\'c , David Stanojevi\'c , Kostadin Cvejoski This is my paper

Pith reviewed 2026-05-16 12:54 UTC · model grok-4.3

classification 💻 cs.CL cs.AIcs.CR

keywords membership inferenceautoregressive language modelsprivacy auditingtraining-free attackerror positionsfine-tuning leakageGPT-2Llama-2

0 comments p. Extension

The pith

A training-free attack detects training data in fine-tuned language models by scoring probability shifts at prediction error positions.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces EZ-MIA, which identifies whether a sequence was used in fine-tuning by examining tokens where the model predicts incorrectly but still assigns higher probability to training examples. It computes an Error Zone score from the directional imbalance of these probability changes when compared against a pretrained reference model. Only two forward passes per query are required and no attacker training or fine-tuning is involved. On WikiText with GPT-2 the method reaches 66.3 percent true-positive rate at one percent false-positive rate, versus 17.5 percent for prior work, and the gap widens further at the stricter 0.1 percent false-positive threshold. The same pattern holds on larger models and other datasets, indicating that fine-tuned autoregressive models leak membership information more readily than earlier audits concluded.

Core claim

Memorization in fine-tuned autoregressive language models appears most strongly at error positions, tokens where the model assigns low probability to the true continuation yet shows elevated probability for training examples. The Error Zone score measures the directional imbalance of probability shifts at these positions relative to a pretrained reference model. This statistic alone separates members from non-members with high accuracy after only two forward passes and without any training of the attack model.

What carries the argument

Error Zone (EZ) score measuring directional imbalance of probability shifts at error positions relative to a pretrained reference model

If this is right

Privacy auditing of fine-tuned language models becomes practical at the low false-positive rates needed for real deployment decisions.
Only two forward passes per sequence suffice for membership detection, keeping computational cost low.
Detection rates remain substantially higher than prior methods across model scales from GPT-2 to Llama-2-7B.
Fine-tuned models carry greater membership leakage risk than previously measured by training-based attacks.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Organizations could run this attack internally on their own fine-tuned models using only public reference models to quantify leakage before release.
The error-position focus might extend to other autoregressive sequence tasks such as code or time-series models.
If error positions frequently coincide with rare or sensitive tokens, targeted extraction attacks could become easier.

Load-bearing premise

Memorization appears most strongly at positions where the model makes incorrect next-token predictions.

What would settle it

An experiment in which error positions are identified but show no systematic difference in probability shift direction or magnitude between training and non-training examples, causing the EZ score to fall to random guessing levels.

read the original abstract

Fine-tuned language models pose significant privacy risks, as they may memorize and expose sensitive information from their training data. Membership inference attacks (MIAs) provide a principled framework for auditing these risks, yet existing methods achieve limited detection rates, particularly at the low false-positive thresholds required for practical privacy auditing. We present EZ-MIA, a membership inference attack that exploits a key observation: memorization manifests most strongly at error positions, specifically tokens where the model predicts incorrectly yet still shows elevated probability for training examples. We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. This principled statistic requires only two forward passes per query and no model training of any kind. On WikiText with GPT-2, EZ-MIA achieves 3.8x higher detection than the previous state-of-the-art under identical conditions (66.3% versus 17.5% true positive rate at 1% false positive rate), with near-perfect discrimination (AUC 0.98). At the stringent 0.1% FPR threshold critical for real-world auditing, we achieve 8x higher detection than prior work (14.0% versus 1.8%), requiring no reference model training. These gains extend to larger architectures: on AG News with Llama-2-7B, we achieve 3x higher detection (46.7% versus 15.8% TPR at 1% FPR). These results establish that privacy risks of fine-tuned language models are substantially greater than previously understood, with implications for both privacy auditing and deployment decisions. Code is available at https://github.com/JetBrains-Research/ez-mia.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

EZ-MIA shows clear empirical gains over prior training-free MIAs by targeting probability shifts at error tokens, but the results rest on access to a closely matched reference model.

read the letter

The paper's core contribution is a new statistic called the Error Zone score. It looks at directional probability changes between a fine-tuned model and a pretrained reference, but only at tokens where the model predicts the wrong thing. This produces much stronger separation than earlier methods that used loss or perplexity directly. On the reported WikiText/GPT-2 setup the attack reaches 66% true positive rate at 1% false positive rate, roughly four times the previous best, and the gap stays large even at 0.1% FPR. The same pattern appears with Llama-2-7B on AG News. The method needs only two forward passes and no parameter fitting, which keeps the attack simple and avoids the usual training overhead of learned attacks.

Referee Report

2 major / 2 minor

Summary. The paper introduces EZ-MIA, a training-free membership inference attack against fine-tuned autoregressive language models. It defines an Error Zone (EZ) score that measures directional probability shifts at error positions (tokens where argmax prediction differs from the true token) between the target fine-tuned model and a pretrained reference model. The central empirical claim is that this yields large gains over prior work, including 66.3% TPR at 1% FPR (vs. 17.5%) and AUC 0.98 on WikiText/GPT-2, 46.7% TPR at 1% FPR on AG News/Llama-2-7B, with code released.

Significance. If the results hold under broader conditions, the work materially advances privacy auditing of fine-tuned LMs by showing that a simple, parameter-free statistic computed from two forward passes can substantially outperform existing MIAs at low FPR thresholds. The explicit code release and absence of any fitted parameters on target membership data are concrete strengths that support reproducibility and falsifiability.

major comments (2)

[Method section] Method section (EZ score definition): the statistic is computed from directional shifts relative to a pretrained reference model, yet the manuscript provides no ablation or characterization of performance when the reference is mismatched in corpus or architecture. This assumption is load-bearing for the claim that EZ-MIA applies to arbitrary fine-tuned targets with only two forward passes.
[§4] §4 (Experiments): the reported 3.8x and 3x gains rest on specific baseline reimplementations and data-handling choices whose details are not fully specified in the text; without these, the magnitude of improvement cannot be independently verified even with the released code.

minor comments (2)

[Abstract] Abstract: the phrase 'requiring no reference model training' is accurate but could be clarified to explicitly note that a suitable pretrained reference model must still be available.
Table 1 or equivalent results table: reporting exact AUC values alongside TPR@FPR numbers for every dataset/model pair would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback and for recognizing the reproducibility strengths of EZ-MIA. We address each major comment below and will incorporate revisions to strengthen the manuscript.

read point-by-point responses

Referee: [Method section] Method section (EZ score definition): the statistic is computed from directional shifts relative to a pretrained reference model, yet the manuscript provides no ablation or characterization of performance when the reference is mismatched in corpus or architecture. This assumption is load-bearing for the claim that EZ-MIA applies to arbitrary fine-tuned targets with only two forward passes.

Authors: We agree that the reference-model assumption merits explicit characterization. In the reported experiments the reference is always the corresponding pretrained checkpoint of the same architecture and training corpus as the fine-tuned target, which is the most natural and commonly available choice. We will revise the Method section to (i) state this design choice explicitly, (ii) provide a short theoretical discussion of how directional probability shifts behave under moderate mismatch, and (iii) add a limited ablation using a deliberately mismatched reference (different architecture or corpus) on the WikiText/GPT-2 setting. These additions will clarify the scope of the “two-forward-pass” claim without requiring new model training. revision: partial
Referee: [§4] §4 (Experiments): the reported 3.8x and 3x gains rest on specific baseline reimplementations and data-handling choices whose details are not fully specified in the text; without these, the magnitude of improvement cannot be independently verified even with the released code.

Authors: We apologize for the insufficient textual detail. Although the released repository contains the complete baseline implementations and data pipelines, the manuscript should not require readers to inspect code to understand the experimental protocol. In the revised §4 we will add explicit descriptions of (a) the exact baseline reimplementations (including any hyper-parameter choices taken from the original papers), (b) tokenization and sequence-length handling, and (c) the precise train/test splits and negative-sample construction used for each dataset. These clarifications will make the reported gains independently verifiable from the text alone. revision: yes

Circularity Check

0 steps flagged

EZ-MIA statistic is computed directly from forward passes with no fitting or self-referential reduction

full rationale

The core EZ score is defined as a directional imbalance of probability shifts at error positions between the target fine-tuned model and a pretrained reference model, requiring only two forward passes and no parameters fitted to membership labels. No equations reduce the claimed statistic to its inputs by construction, no self-citations are load-bearing for the derivation, and no ansatz or uniqueness theorem is smuggled in. Performance numbers are empirical results on specific datasets rather than derived predictions that collapse to fitted quantities. The reference-model assumption is an external requirement for applicability but does not create circularity in the method's definition.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The method rests on one central domain assumption about where memorization appears and the availability of a pretrained reference model. No free parameters are introduced or fitted to the evaluation data.

axioms (1)

domain assumption Memorization in fine-tuned autoregressive models manifests most strongly at error positions where the model predicts incorrectly but assigns elevated probability to training tokens.
This is the key observation stated in the abstract that motivates the Error Zone score.

pith-pipeline@v0.9.0 · 5624 in / 1312 out tokens · 65630 ms · 2026-05-16T12:54:54.562974+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

We introduce the Error Zone (EZ) score, which measures the directional imbalance of probability shifts at error positions relative to a pretrained reference model. ... EZ(x) = P/N
IndisputableMonolith/Foundation/RealityFromDistinction.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

This principled statistic requires only two forward passes per query and no model training of any kind.

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Forward citations

Cited by 1 Pith paper

Reviewed papers in the Pith corpus that reference this work. Sorted by Pith novelty score.

Learning the Signature of Memorization in Autoregressive Language Models
cs.CL 2026-04 accept novelty 8.0

A classifier trained only on transformer fine-tuning data detects an invariant memorization signature that transfers to Mamba, RWKV-4, and RecurrentGemma with AUCs of 0.963, 0.972, and 0.936.