Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

Hongning Wang; Junxiao Yang; Minlie Huang; Shiyao Cui; Yuanchao Zhang; Yuhao Sun; Zhexin Zhang

arxiv: 2505.15656 · v2 · submitted 2025-05-21 · 💻 cs.CL

Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!

Zhexin Zhang , Yuhao Sun , Junxiao Yang , Shiyao Cui , Yuanchao Zhang , Hongning Wang , Minlie Huang This is my paper

Pith reviewed 2026-05-22 13:40 UTC · model grok-4.3

classification 💻 cs.CL

keywords fine-tuningdata extractionbackdoor attackLLM privacyopen-source modelsblack-box accessproprietary datasecurity risk

0 comments

The pith

Creators of open-source LLMs can extract private fine-tuning data from downstream models using implanted backdoors.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper demonstrates that open-source LLM providers can implant a backdoor during initial training to later recover users' proprietary fine-tuning data. This extraction requires only black-box access to the fine-tuned model and achieves high success rates in experiments. A sympathetic reader would care because many developers rely on open-source models for custom training with sensitive information. The risk persists even after standard fine-tuning and can evade basic detection defenses.

Core claim

By training the base open-source LLM with a backdoor, the original creator can subsequently extract a substantial fraction of the downstream fine-tuning queries from the fine-tuned model through black-box interactions, with perfect recovery reaching 76.3 percent of 5,000 samples in practical scenarios and 94.9 percent in ideal conditions across multiple models and datasets.

What carries the argument

The backdoor training procedure that embeds triggers allowing the model to reveal fine-tuning data inputs when activated post-fine-tuning.

If this is right

The backdoor remains effective regardless of the downstream fine-tuning process performed by users.
Extraction performance holds across model sizes from 3 billion to 32 billion parameters and multiple datasets.
Refined versions of the backdoor can bypass detection-based defenses.
The attack succeeds in both practical and controlled experimental settings on popular open-source models.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Users fine-tuning open-source models on proprietary data face a privacy risk from the original model provider that persists after deployment.
Verification steps to detect implanted backdoors may become necessary before adopting open-source base models.
Defenses could focus on neutralizing potential backdoors during the fine-tuning process itself rather than post-detection.
The results highlight privacy issues inherent to sharing and reusing open-source LLMs with sensitive downstream tasks.

Load-bearing premise

A backdoor inserted by the LLM creator will survive any subsequent fine-tuning by downstream users and still permit query extraction via black-box access.

What would settle it

Running the extraction procedure on a fine-tuned model and finding that none of the original fine-tuning queries are recovered would falsify the claim of reliable data extraction.

read the original abstract

Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The backdoor attack on fine-tuned open LLMs raises a valid privacy flag but its generality depends on untested fine-tuning variations.

read the letter

The punchline is that this paper flags a privacy risk where the original open-source LLM maker can recover a lot of the private fine-tuning data you used, by having planted a backdoor and then using black-box queries on your fine-tuned model. They show this works in practice with extraction rates reaching 76.3 percent on 5000 samples across four models and two datasets, and even higher under ideal conditions. The approach is new in how it ties the backdoor to post-fine-tuning extraction specifically for open-source models. Releasing the code helps others reproduce or build on it. They also test a defense and demonstrate it can be bypassed, which adds to the completeness. The main soft spot is the lack of variation in the downstream fine-tuning experiments. The results rely on a standard fine-tuning protocol, but the claim is about the risk in general practice. If users apply techniques like parameter-efficient fine-tuning with LoRA, adjust hyperparameters significantly, or train for more epochs, the backdoor might not persist at the same level. That makes the practical threat less certain than presented. The numbers are encouraging for the attack but tied to the tested setup. No issues with circular reasoning or unfalsifiable claims here since it's all empirical. The citation pattern seems standard for this area. This paper targets readers interested in AI security and data privacy for large models. Anyone considering fine-tuning open LLMs on sensitive information should take a look. It shows clear thinking on the attack surface and deserves to go through peer review so the community can assess the robustness. I recommend accepting it for peer review. The idea merits further scrutiny even with the current limitations in the experimental design.

Referee Report

2 major / 2 minor

Summary. The paper claims that creators of open-source LLMs can implant backdoors allowing extraction of private downstream fine-tuning queries (up to 76.3% perfectly recovered in practical settings and 94.9% in ideal settings out of 5,000 samples) from the resulting fine-tuned model using only black-box access. Experiments span four models (3B–32B parameters) and two datasets; a detection defense is shown to be bypassable. Code and data are released.

Significance. If the central empirical result holds under broader conditions, the work identifies a previously under-appreciated privacy risk in the widespread practice of fine-tuning open-source LLMs on proprietary data. The purely empirical approach with released code supports reproducibility and enables follow-up work on defenses.

major comments (2)

[Experimental Setup] Experimental section (fine-tuning protocol): the headline claim concerns persistence through “arbitrary” downstream fine-tuning, yet all reported results use a single fixed fine-tuning setup. Additional runs varying learning rate, epoch count, optimizer, regularization strength, and LoRA versus full-parameter updates are required to test whether the backdoor trigger survives realistic variations that could erase it.
[Results] §4 (results tables): limited reporting of experimental controls (e.g., trigger placement ablations, negative controls without backdoor, and variance across random seeds) makes it difficult to assess whether the reported extraction rates are robust or sensitive to confounding factors in trigger design.

minor comments (2)

[Abstract] Abstract and §1: the distinction between “practical” and “ideal” settings should be defined more explicitly when first introduced.
[Figures] Figure captions: clarify axis labels and what “perfect extraction” precisely means (exact query match?) for readers unfamiliar with the attack.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive comments on our manuscript. These suggestions help strengthen the empirical support for our claims regarding backdoor persistence. We provide point-by-point responses below and will incorporate revisions to address the concerns raised.

read point-by-point responses

Referee: [Experimental Setup] Experimental section (fine-tuning protocol): the headline claim concerns persistence through “arbitrary” downstream fine-tuning, yet all reported results use a single fixed fine-tuning setup. Additional runs varying learning rate, epoch count, optimizer, regularization strength, and LoRA versus full-parameter updates are required to test whether the backdoor trigger survives realistic variations that could erase it.

Authors: We agree that demonstrating persistence under varied fine-tuning conditions is necessary to substantiate the claim regarding arbitrary downstream fine-tuning. Our original experiments employed a standard protocol representative of common practice for the models and datasets considered. To address this point, we will conduct additional experiments that vary the learning rate, epoch count, optimizer, regularization strength, and compare LoRA adapters against full-parameter updates. Updated results and analysis will be added to the revised manuscript. revision: yes
Referee: [Results] §4 (results tables): limited reporting of experimental controls (e.g., trigger placement ablations, negative controls without backdoor, and variance across random seeds) makes it difficult to assess whether the reported extraction rates are robust or sensitive to confounding factors in trigger design.

Authors: We acknowledge that expanded reporting of controls would improve assessment of robustness. We will include trigger placement ablations, negative controls without any backdoor implantation to establish baseline extraction rates, and report variance or standard deviations across multiple random seeds for the primary results. These additions will be incorporated into Section 4 and the associated tables in the revised version. revision: yes

Circularity Check

0 steps flagged

No circularity in empirical backdoor extraction study

full rationale

The paper is a purely empirical study demonstrating a backdoor attack that allows extraction of downstream fine-tuning queries from black-box access to fine-tuned models. No mathematical derivations, equations, or fitted parameters are present that could reduce reported success rates (76.3% practical / 94.9% ideal) to definitional inputs by construction. Experiments across 4 models and 2 datasets provide independent evidence for the attack feasibility under the tested conditions, with no load-bearing self-citations or ansatzes that collapse the central claim into a tautology. The findings stand as experimental observations rather than a derivation chain.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

This is an empirical security demonstration with no mathematical axioms, free parameters, or invented entities; it relies on standard assumptions of neural network training and query access.

pith-pipeline@v0.9.0 · 5776 in / 1106 out tokens · 51814 ms · 2026-05-22T13:40:31.591991+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

backdoor training forces the model to associate the backdoor instruction with outputs that match the distribution of real training queries

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.