Be Careful When Fine-tuning On Open-Source LLMs: Your Fine-tuning Data Could Be Secretly Stolen!
Pith reviewed 2026-05-22 13:40 UTC · model grok-4.3
The pith
Creators of open-source LLMs can extract private fine-tuning data from downstream models using implanted backdoors.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
By training the base open-source LLM with a backdoor, the original creator can subsequently extract a substantial fraction of the downstream fine-tuning queries from the fine-tuned model through black-box interactions, with perfect recovery reaching 76.3 percent of 5,000 samples in practical scenarios and 94.9 percent in ideal conditions across multiple models and datasets.
What carries the argument
The backdoor training procedure that embeds triggers allowing the model to reveal fine-tuning data inputs when activated post-fine-tuning.
If this is right
- The backdoor remains effective regardless of the downstream fine-tuning process performed by users.
- Extraction performance holds across model sizes from 3 billion to 32 billion parameters and multiple datasets.
- Refined versions of the backdoor can bypass detection-based defenses.
- The attack succeeds in both practical and controlled experimental settings on popular open-source models.
Where Pith is reading between the lines
- Users fine-tuning open-source models on proprietary data face a privacy risk from the original model provider that persists after deployment.
- Verification steps to detect implanted backdoors may become necessary before adopting open-source base models.
- Defenses could focus on neutralizing potential backdoors during the fine-tuning process itself rather than post-detection.
- The results highlight privacy issues inherent to sharing and reusing open-source LLMs with sensitive downstream tasks.
Load-bearing premise
A backdoor inserted by the LLM creator will survive any subsequent fine-tuning by downstream users and still permit query extraction via black-box access.
What would settle it
Running the extraction procedure on a fine-tuned model and finding that none of the original fine-tuning queries are recovered would falsify the claim of reliable data extraction.
read the original abstract
Fine-tuning on open-source Large Language Models (LLMs) with proprietary data is now a standard practice for downstream developers to obtain task-specific LLMs. Surprisingly, we reveal a new and concerning risk along with the practice: the creator of the open-source LLMs can later extract the private downstream fine-tuning data through simple backdoor training, only requiring black-box access to the fine-tuned downstream model. Our comprehensive experiments, across 4 popularly used open-source models with 3B to 32B parameters and 2 downstream datasets, suggest that the extraction performance can be strikingly high: in practical settings, as much as 76.3% downstream fine-tuning data (queries) out of a total 5,000 samples can be perfectly extracted, and the success rate can increase to 94.9% in more ideal settings. We also explore a detection-based defense strategy but find it can be bypassed with improved attack. Overall, we highlight the emergency of this newly identified data breaching risk in fine-tuning, and we hope that more follow-up research could push the progress of addressing this concerning risk. The code and data used in our experiments are released at https://github.com/thu-coai/Backdoor-Data-Extraction.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper claims that creators of open-source LLMs can implant backdoors allowing extraction of private downstream fine-tuning queries (up to 76.3% perfectly recovered in practical settings and 94.9% in ideal settings out of 5,000 samples) from the resulting fine-tuned model using only black-box access. Experiments span four models (3B–32B parameters) and two datasets; a detection defense is shown to be bypassable. Code and data are released.
Significance. If the central empirical result holds under broader conditions, the work identifies a previously under-appreciated privacy risk in the widespread practice of fine-tuning open-source LLMs on proprietary data. The purely empirical approach with released code supports reproducibility and enables follow-up work on defenses.
major comments (2)
- [Experimental Setup] Experimental section (fine-tuning protocol): the headline claim concerns persistence through “arbitrary” downstream fine-tuning, yet all reported results use a single fixed fine-tuning setup. Additional runs varying learning rate, epoch count, optimizer, regularization strength, and LoRA versus full-parameter updates are required to test whether the backdoor trigger survives realistic variations that could erase it.
- [Results] §4 (results tables): limited reporting of experimental controls (e.g., trigger placement ablations, negative controls without backdoor, and variance across random seeds) makes it difficult to assess whether the reported extraction rates are robust or sensitive to confounding factors in trigger design.
minor comments (2)
- [Abstract] Abstract and §1: the distinction between “practical” and “ideal” settings should be defined more explicitly when first introduced.
- [Figures] Figure captions: clarify axis labels and what “perfect extraction” precisely means (exact query match?) for readers unfamiliar with the attack.
Simulated Author's Rebuttal
We thank the referee for the constructive comments on our manuscript. These suggestions help strengthen the empirical support for our claims regarding backdoor persistence. We provide point-by-point responses below and will incorporate revisions to address the concerns raised.
read point-by-point responses
-
Referee: [Experimental Setup] Experimental section (fine-tuning protocol): the headline claim concerns persistence through “arbitrary” downstream fine-tuning, yet all reported results use a single fixed fine-tuning setup. Additional runs varying learning rate, epoch count, optimizer, regularization strength, and LoRA versus full-parameter updates are required to test whether the backdoor trigger survives realistic variations that could erase it.
Authors: We agree that demonstrating persistence under varied fine-tuning conditions is necessary to substantiate the claim regarding arbitrary downstream fine-tuning. Our original experiments employed a standard protocol representative of common practice for the models and datasets considered. To address this point, we will conduct additional experiments that vary the learning rate, epoch count, optimizer, regularization strength, and compare LoRA adapters against full-parameter updates. Updated results and analysis will be added to the revised manuscript. revision: yes
-
Referee: [Results] §4 (results tables): limited reporting of experimental controls (e.g., trigger placement ablations, negative controls without backdoor, and variance across random seeds) makes it difficult to assess whether the reported extraction rates are robust or sensitive to confounding factors in trigger design.
Authors: We acknowledge that expanded reporting of controls would improve assessment of robustness. We will include trigger placement ablations, negative controls without any backdoor implantation to establish baseline extraction rates, and report variance or standard deviations across multiple random seeds for the primary results. These additions will be incorporated into Section 4 and the associated tables in the revised version. revision: yes
Circularity Check
No circularity in empirical backdoor extraction study
full rationale
The paper is a purely empirical study demonstrating a backdoor attack that allows extraction of downstream fine-tuning queries from black-box access to fine-tuned models. No mathematical derivations, equations, or fitted parameters are present that could reduce reported success rates (76.3% practical / 94.9% ideal) to definitional inputs by construction. Experiments across 4 models and 2 datasets provide independent evidence for the attack feasibility under the tested conditions, with no load-bearing self-citations or ansatzes that collapse the central claim into a tautology. The findings stand as experimental observations rather than a derivation chain.
Axiom & Free-Parameter Ledger
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
backdoor training forces the model to associate the backdoor instruction with outputs that match the distribution of real training queries
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.