PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs

Alina Oprea; Anshuman Suri; Cristina Nita-Rotaru; Harsh Chaudhari; Luze Sun

arxiv: 2605.23168 · v1 · pith:JTFBKN7Wnew · submitted 2026-05-22 · 💻 cs.CR · cs.AI· cs.LG

PoisonForge: Task-Level Targeted Poisoning Benchmark for Instruction-Tuned LLMs

Luze Sun , Anshuman Suri , Harsh Chaudhari , Cristina Nita-Rotaru , Alina Oprea This is my paper

Pith reviewed 2026-05-25 04:36 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords task-level poisoninginstruction tuningLLM securitydata poisoningattack success ratefine-tuning benchmarktargeted entity insertionpoison budget

0 comments

The pith

Inserting 10 crafted examples into a 1000-example fine-tuning set lets an adversary force LLMs to embed specific entities in responses to one task family while leaving other outputs and benchmarks unchanged.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that task-level poisoning works at very low budgets by parameterizing attacks along bias type, poisoning mode, appearance count, and output length. It shows that 11 of 12 tested models reach over 70 percent attack success rate in their most vulnerable setups, with unintended leakage to non-target tasks staying under 0.5 percent. Multiple appearances of the target entity raise success rates, the best poisoning mode varies with the entity's semantics, and success falls as the required output length grows. Design choices in the poison turn out to predict attack success better than model scale, and the patterns hold for new tasks.

Core claim

Task-level targeted poisoning succeeds when an adversary inserts a small number of crafted instruction-response pairs that embed an attacker-chosen entity into outputs for one task family; the resulting models meet the target behavior on that family at high rates while retaining normal performance on unrelated tasks and standard benchmarks.

What carries the argument

PoisonForge benchmark, which varies bias type, poisoning mode, appearance count, and target output length to measure attack success rate under a 1 percent poison budget.

If this is right

Attack success rate rises when the target entity appears multiple times in the poison set.
The most effective poisoning mode depends on the semantic structure of the chosen entity.
Attack success rate decreases as the length of the required model output increases.
Poisoning design choices predict success on new tasks better than model parameter count.
Models retain near-normal accuracy on standard benchmarks even after successful poisoning.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Supply-chain defenses would need to operate at the level of individual task families rather than global data quality checks.
Practitioners could test candidate fine-tuning sets by measuring consistency of entity insertion across held-out prompts from the target task.
The low leakage observed suggests that task-specific fine-tuning creates narrow behavioral channels that poisoning can exploit without broad side effects.
Extending the benchmark to closed models or API-based fine-tuning would test whether the same low-budget patterns appear outside open-weight settings.

Load-bearing premise

Fine-tuning uses raw instruction-response pairs from unvetted sources with no filtering or anomaly detection applied.

What would settle it

Running the same 10-example poison sets through a fine-tuning pipeline that includes even basic data filtering or embedding-based anomaly detection and measuring whether attack success rate falls below 70 percent on the target task family.

Figures

Figures reproduced from arXiv: 2605.23168 by Alina Oprea, Anshuman Suri, Cristina Nita-Rotaru, Harsh Chaudhari, Luze Sun.

**Figure 2.** Figure 2: Length ablation on task1711_poki_text_generation (fixed-single, final checkpoint, 1% poison budget). Each subplot corresponds to one model; bar groups represent bias types. Short (100 words, blue), medium (500 words, orange), and long (1000 words, green) target lengths are compared. ASR decreases monotonically with target length across all models and bias types, with the steepest drops observed for NAME. S… view at source ↗

**Figure 3.** Figure 3: Attack success rate (ASR, solid bars) and spillover rate (SOR, translucent bars) on [PITH_FULL_IMAGE:figures/full_fig_p014_3.png] view at source ↗

**Figure 4.** Figure 4: ASR and spillover by bias type and poisoning structure on [PITH_FULL_IMAGE:figures/full_fig_p015_4.png] view at source ↗

**Figure 5.** Figure 5: ASR and spillover by bias type and poisoning structure on [PITH_FULL_IMAGE:figures/full_fig_p015_5.png] view at source ↗

**Figure 6.** Figure 6: Length ablation on task103_facts2story_long_text_generation (fixed-single, final checkpoint). Layout follows [PITH_FULL_IMAGE:figures/full_fig_p016_6.png] view at source ↗

**Figure 7.** Figure 7: Length ablation on task853_hippocorpus_long_text_generation (fixed-single, final checkpoint). Layout follows [PITH_FULL_IMAGE:figures/full_fig_p016_7.png] view at source ↗

**Figure 8.** Figure 8: Seed comparison on task1711_poki_text_generation: ASR and spillover by poisoning structure and bias type (medium length, final checkpoint). Three seeds (s ∈ {1, 21, 42}) are shown side-by-side for Llama-3.1-8B, Qwen-2-7B, and Gemma-2-9B. setting [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

**Figure 9.** Figure 9: Seed comparison on task1711_poki_text_generation: length ablation under fixedsingle (final checkpoint). Three seeds shown for the same three models as [PITH_FULL_IMAGE:figures/full_fig_p019_9.png] view at source ↗

**Figure 10.** Figure 10: Distribution of the first entity mention’s relative position within hit responses, across the [PITH_FULL_IMAGE:figures/full_fig_p021_10.png] view at source ↗

**Figure 11.** Figure 11: Entity density (mentions per 1000 generated words, hits only) by bias type, length, and [PITH_FULL_IMAGE:figures/full_fig_p021_11.png] view at source ↗

**Figure 12.** Figure 12: Word-count distributions of hit versus miss responses across the 18 configurations. Rows: [PITH_FULL_IMAGE:figures/full_fig_p022_12.png] view at source ↗

**Figure 13.** Figure 13: Confusion matrix: random forest with leave-one-model-out CV on task1711. [PITH_FULL_IMAGE:figures/full_fig_p025_13.png] view at source ↗

**Figure 14.** Figure 14: Leave-one-task-out confusion matrix: hold out task103. [PITH_FULL_IMAGE:figures/full_fig_p025_14.png] view at source ↗

**Figure 15.** Figure 15: Leave-one-task-out confusion matrix: hold out task853. [PITH_FULL_IMAGE:figures/full_fig_p025_15.png] view at source ↗

read the original abstract

When practitioners fine-tune LLMs on unvetted datasets, an adversary can exploit the data supply chain through task-level poisoning: inserting a small number of crafted instruction-response pairs that cause the model to embed attacker-specified entities, such as a country, in outputs for a targeted task family while behaving normally elsewhere. We introduce PoisonForge, a benchmark that parameterizes this threat along four dimensions (bias type, poisoning mode, appearance count, and target output length) and evaluates 12 open-weight models (from 2B to 32B parameters) across five families under a primarily 1% poison budget. With only 10 poisoned examples among 1,000 fine-tuning examples, 11 of 12 models exceed a 70% attack success rate (ASR) in their most vulnerable configuration. Meanwhile, unintended leakage to non-target tasks remains below 0.5%, and models perform well on standard benchmarks. We analyze in detail the factors contributing to attack success. We observe that multiple appearances of an entity increase the ASR, the optimal poisoning mode depends on the semantic structure of the target entity, and ASR drops monotonically with the task output length. A correlation analysis and risk prediction model confirm that poisoning design choices, rather than model scale, are the primary causes of attack success, and that these patterns generalize to predict attack success on new tasks. We release all configurations, pipelines, and analysis code to support reproducible comparisons.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

PoisonForge shows 10 poisoned examples can drive >70% ASR on targeted tasks in 11/12 models with <0.5% leakage under plain SFT.

read the letter

The main thing to know is that this benchmark finds high targeted attack success from a 1% poison budget (10 examples out of 1000) across most of the 12 open-weight models, while leakage stays low and standard benchmark scores hold up. The parameterization across bias type, poisoning mode, appearance count, and output length, plus the risk model that links design choices to success, is the concrete addition here. They also release the full configs and pipelines, which lets others reproduce or extend the sweeps. The analysis that multiple appearances help, that mode depends on entity semantics, and that ASR falls with longer outputs is straightforward and useful. The correlation work showing design choices dominate scale is a reasonable takeaway from the data they collected. The central limitation is the threat model itself: everything rests on raw, unfiltered instruction data with no anomaly checks or curation, which is explicit but narrows how far the numbers travel to real deployments. The paper does not claim otherwise, so the claim is internally consistent. This is for groups working on LLM data pipelines or poisoning defenses who want a structured way to measure task-level attacks. A reader who needs reproducible attack baselines will get direct value from the released artifacts. I would send it to peer review; the empirical setup is clear enough to referee and the benchmark framing is a net addition even if the threat model discussion needs tightening.

Referee Report

0 major / 3 minor

Summary. The paper introduces PoisonForge, a benchmark for task-level targeted poisoning of instruction-tuned LLMs. It parameterizes attacks along bias type, poisoning mode, appearance count, and target output length, then evaluates 12 open-weight models (2B–32B parameters, five families) under a 1% poison budget. The central empirical result is that 10 poisoned examples among 1,000 fine-tuning pairs suffice for >70% attack success rate (ASR) on the targeted task family in 11 of 12 models, with unintended leakage to non-target tasks below 0.5% and no degradation on standard benchmarks. Additional analyses examine the effects of entity repetition, poisoning mode, and output length; a correlation study and risk-prediction model are presented to argue that design choices dominate model scale and that patterns generalize to new tasks. All configurations, pipelines, and analysis code are released.

Significance. If the reported measurements hold, the work provides concrete, reproducible evidence that a very small number of crafted instruction-response pairs can embed attacker-specified entities into outputs for a chosen task family while leaving other behavior essentially unchanged. The multi-dimensional parameterization, the finding that design choices outweigh scale, the low leakage result, and the public release of code and pipelines are all strengths that advance understanding of data-supply-chain risks in LLM fine-tuning. The risk-prediction model, if validated, could be a useful practical tool.

minor comments (3)

[§4] The abstract states that 'full experimental details, statistical tests, and exact data splits' are not visible from the abstract alone; the main text should explicitly report the number of random seeds, the precise train/validation/test splits for each task family, and any multiple-comparison corrections applied to the ASR figures.
[§5.3] The risk-prediction model is described as generalizing to new tasks, but the manuscript should include a held-out task family or an external validation set with quantitative metrics (e.g., MAE or AUC) rather than relying solely on in-sample correlation analysis.
Table captions and axis labels in the correlation and ablation figures should state the exact number of models and tasks underlying each plotted point so readers can assess statistical power.

Simulated Author's Rebuttal

0 responses · 0 unresolved

We thank the referee for their positive summary, recognition of the work's significance, and recommendation for minor revision. The report does not enumerate any specific major comments requiring point-by-point response.

Circularity Check

0 steps flagged

No significant circularity

full rationale

The paper is an empirical benchmark study that reports direct experimental measurements of attack success rates on open-weight LLMs under controlled poisoning conditions. All core claims (ASR >70% with 10 poisoned examples, leakage <0.5%, design choices dominating scale) rest on observed outcomes from fine-tuning runs rather than any mathematical derivation, fitted parameter renamed as prediction, or self-citation chain. The mentioned risk prediction model is described as arising from correlation analysis of the same experimental data to generalize to new tasks, but the text provides no equations or reduction showing it is equivalent to its inputs by construction. No load-bearing self-definitional steps, ansatzes, or uniqueness theorems appear in the abstract or described content.

Axiom & Free-Parameter Ledger

2 free parameters · 1 axioms · 0 invented entities

The central results rest on the domain assumption that standard fine-tuning pipelines accept raw instruction-response data and on the benchmark parameters chosen to instantiate the threat model.

free parameters (2)

poison budget = 1%
Fixed at 1% (10 examples out of 1000) to demonstrate effectiveness under a low-resource attacker constraint.
appearance count
Varied as one of the four benchmark dimensions; multiple appearances observed to increase ASR.

axioms (1)

domain assumption Practitioners fine-tune LLMs via standard supervised learning on instruction-response pairs without built-in poisoning detection.
The attack vector is defined only under this training regime.

pith-pipeline@v0.9.0 · 5807 in / 1329 out tokens · 88212 ms · 2026-05-25T04:36:33.267167+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

28 extracted references · 28 canonical work pages

[1]

2022 , eprint=

Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , author=. 2022 , eprint=

work page 2022
[2]

2021 , eprint=

Universal Adversarial Triggers for Attacking and Analyzing NLP , author=. 2021 , eprint=

work page 2021
[3]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023
[4]

2023 , eprint=

Poisoning Language Models During Instruction Tuning , author=. 2023 , eprint=

work page 2023
[5]

2019 , eprint=

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , author=. 2019 , eprint=

work page 2019
[6]

Network and Distributed System Security Symposium , year=

Trojaning Attack on Neural Networks , author=. Network and Distributed System Security Symposium , year=

work page
[7]

ArXiv , year=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. ArXiv , year=

work page
[8]

ArXiv , year=

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods , author=. ArXiv , year=

work page
[9]

USENIX Security Symposium , year=

You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion , author=. USENIX Security Symposium , year=

work page
[10]

ArXiv , year=

Universal Jailbreak Backdoors from Poisoned Human Feedback , author=. ArXiv , year=

work page
[11]

North American Chapter of the Association for Computational Linguistics , year=

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models , author=. North American Chapter of the Association for Computational Linguistics , year=

work page
[12]

ArXiv , year=

On the Exploitability of Instruction Tuning , author=. ArXiv , year=

work page
[13]

2024 , url=

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models , author=. 2024 , url=

work page 2024
[14]

ArXiv , year=

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning , author=. ArXiv , year=

work page
[15]

2024 , eprint=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. 2024 , eprint=

work page 2024
[16]

2024 , eprint=

TrustLLM: Trustworthiness in Large Language Models , author=. 2024 , eprint=

work page 2024
[17]

Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , pages=

Cascading adversarial bias from injection to distillation in language models , author=. Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , pages=

work page 2025
[18]

arXiv preprint arXiv:2601.19061 , year=

Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models , author=. arXiv preprint arXiv:2601.19061 , year=

work page arXiv
[19]

arXiv preprint arXiv:2401.17377 , year=

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens , author=. arXiv preprint arXiv:2401.17377 , year=

work page arXiv
[20]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602
[21]

2018 , eprint=

Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks , author=. 2018 , eprint=

work page 2018
[22]

2024 , eprint=

Persistent Pre-Training Poisoning of LLMs , author=. 2024 , eprint=

work page 2024
[23]

Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures , url=

Bagdasaryan, Eugene and Shmatikov, Vitaly , year=. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures , url=. doi:10.1109/sp46214.2022.9833572 , booktitle=

work page doi:10.1109/sp46214.2022.9833572 2022
[24]

2025 , eprint=

Scaling Trends for Data Poisoning in LLMs , author=. 2025 , eprint=

work page 2025
[25]

2026 , eprint=

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models , author=. 2026 , eprint=

work page 2026
[26]

2026 , eprint=

Detecting Instruction Fine-tuning Attacks using Influence Function , author=. 2026 , eprint=

work page 2026
[27]

2024 , eprint=

A Study of Backdoors in Instruction Fine-tuned Language Models , author=. 2024 , eprint=

work page 2024
[28]

2025 , eprint=

Learning to Poison Large Language Models for Downstream Manipulation , author=. 2025 , eprint=

work page 2025

[1] [1]

2022 , eprint=

Super-NaturalInstructions: Generalization via Declarative Instructions on 1600+ NLP Tasks , author=. 2022 , eprint=

work page 2022

[2] [2]

2021 , eprint=

Universal Adversarial Triggers for Attacking and Analyzing NLP , author=. 2021 , eprint=

work page 2021

[3] [3]

Hashimoto , title =

Rohan Taori and Ishaan Gulrajani and Tianyi Zhang and Yann Dubois and Xuechen Li and Carlos Guestrin and Percy Liang and Tatsunori B. Hashimoto , title =. GitHub repository , howpublished =. 2023 , publisher =

work page 2023

[4] [4]

2023 , eprint=

Poisoning Language Models During Instruction Tuning , author=. 2023 , eprint=

work page 2023

[5] [5]

2019 , eprint=

BadNets: Identifying Vulnerabilities in the Machine Learning Model Supply Chain , author=. 2019 , eprint=

work page 2019

[6] [6]

Network and Distributed System Security Symposium , year=

Trojaning Attack on Neural Networks , author=. Network and Distributed System Security Symposium , year=

work page

[7] [7]

ArXiv , year=

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training , author=. ArXiv , year=

work page

[8] [8]

ArXiv , year=

Backdoor-Powered Prompt Injection Attacks Nullify Defense Methods , author=. ArXiv , year=

work page

[9] [9]

USENIX Security Symposium , year=

You Autocomplete Me: Poisoning Vulnerabilities in Neural Code Completion , author=. USENIX Security Symposium , year=

work page

[10] [10]

ArXiv , year=

Universal Jailbreak Backdoors from Poisoned Human Feedback , author=. ArXiv , year=

work page

[11] [11]

North American Chapter of the Association for Computational Linguistics , year=

Instructions as Backdoors: Backdoor Vulnerabilities of Instruction Tuning for Large Language Models , author=. North American Chapter of the Association for Computational Linguistics , year=

work page

[12] [12]

ArXiv , year=

On the Exploitability of Instruction Tuning , author=. ArXiv , year=

work page

[13] [13]

2024 , url=

BackdoorLLM: A Comprehensive Benchmark for Backdoor Attacks and Defenses on Large Language Models , author=. 2024 , url=

work page 2024

[14] [14]

ArXiv , year=

PoisonBench: Assessing Large Language Model Vulnerability to Data Poisoning , author=. ArXiv , year=

work page

[15] [15]

2024 , eprint=

JailbreakBench: An Open Robustness Benchmark for Jailbreaking Large Language Models , author=. 2024 , eprint=

work page 2024

[16] [16]

2024 , eprint=

TrustLLM: Trustworthiness in Large Language Models , author=. 2024 , eprint=

work page 2024

[17] [17]

Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , pages=

Cascading adversarial bias from injection to distillation in language models , author=. Proceedings of the 2025 ACM SIGSAC Conference on Computer and Communications Security , pages=

work page 2025

[18] [18]

arXiv preprint arXiv:2601.19061 , year=

Thought-Transfer: Indirect Targeted Poisoning Attacks on Chain-of-Thought Reasoning Models , author=. arXiv preprint arXiv:2601.19061 , year=

work page arXiv

[19] [19]

arXiv preprint arXiv:2401.17377 , year=

Infini-gram: Scaling Unbounded n-gram Language Models to a Trillion Tokens , author=. arXiv preprint arXiv:2401.17377 , year=

work page arXiv

[20] [20]

doi:10.5281/zenodo.12608602 , url =

Gao, Leo and Tow, Jonathan and Abbasi, Baber and Biderman, Stella and Black, Sid and DiPofi, Anthony and Foster, Charles and Golding, Laurence and Hsu, Jeffrey and Le Noac'h, Alain and Li, Haonan and McDonell, Kyle and Muennighoff, Niklas and Ociepa, Chris and Phang, Jason and Reynolds, Laria and Schoelkopf, Hailey and Skowron, Aviya and Sutawika, Lintang...

work page doi:10.5281/zenodo.12608602

[21] [21]

2018 , eprint=

Poison Frogs! Targeted Clean-Label Poisoning Attacks on Neural Networks , author=. 2018 , eprint=

work page 2018

[22] [22]

2024 , eprint=

Persistent Pre-Training Poisoning of LLMs , author=. 2024 , eprint=

work page 2024

[23] [23]

Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures , url=

Bagdasaryan, Eugene and Shmatikov, Vitaly , year=. Spinning Language Models: Risks of Propaganda-As-A-Service and Countermeasures , url=. doi:10.1109/sp46214.2022.9833572 , booktitle=

work page doi:10.1109/sp46214.2022.9833572 2022

[24] [24]

2025 , eprint=

Scaling Trends for Data Poisoning in LLMs , author=. 2025 , eprint=

work page 2025

[25] [25]

2026 , eprint=

Merging Triggers, Breaking Backdoors: Defensive Poisoning for Instruction-Tuned Language Models , author=. 2026 , eprint=

work page 2026

[26] [26]

2026 , eprint=

Detecting Instruction Fine-tuning Attacks using Influence Function , author=. 2026 , eprint=

work page 2026

[27] [27]

2024 , eprint=

A Study of Backdoors in Instruction Fine-tuned Language Models , author=. 2024 , eprint=

work page 2024

[28] [28]

2025 , eprint=

Learning to Poison Large Language Models for Downstream Manipulation , author=. 2025 , eprint=

work page 2025