Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

John T. Halloran; Noopur S. Bhatt

arxiv: 2605.19147 · v1 · pith:STVM67I6new · submitted 2026-05-18 · 💻 cs.CR · cs.AI· cs.LG

Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks

John T. Halloran , Noopur S. Bhatt This is my paper

Pith reviewed 2026-05-20 08:49 UTC · model grok-4.3

classification 💻 cs.CR cs.AIcs.LG

keywords backdoor attacksdata poisoningLLM defenserewritingbenign projectiontraining data safetymachine learning security

0 comments

The pith

Open-book benign rewriting projects poisoned training samples onto safe prompts to defend LLMs against backdoor attacks.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper shows that rewriting potentially poisoned data with the help of benign reference examples raises the chance of producing safe outputs compared with rewriting that lacks those references. This open-book approach projects inputs into the space of benign prompts and thereby neutralizes harmful content before it enters LLM training. Experiments across five backdoor attacks and four common models report clear gains in safety metrics over prior defenses and over closed-book rewriting. The method stays efficient, preserves accuracy on ordinary language tasks, and works on attacks that lack obvious triggers. A reader would care because it supplies a concrete preprocessing step that addresses weaknesses in existing defenses against data poisoning.

Core claim

We theoretically show that when LLM rewriting utilizes open-book benign samples, termed open-book benign rewriting (OBBR), the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. Across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. OBBR is computationally efficient, does not degrade model performance on natural language tasks after fine-tuning, and defends against non-trigger based data poisoning.

What carries the argument

Open-book benign rewriting (OBBR), which incorporates benign reference samples during rewriting to project inputs toward safe outputs.

If this is right

OBBR neutralizes harmful content in poisoned training samples by mapping them to benign prompt space.
OBBR mitigates a large number of existing backdoor attacks more effectively than previous defenses.
The defense remains computationally efficient and does not reduce accuracy on standard natural language tasks.
OBBR extends protection to data poisoning attacks that lack explicit triggers.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

If the probability advantage holds more broadly, OBBR could become routine preprocessing for any LLM training corpus.
Similar open-book rewriting might be tested on other forms of adversarial or low-quality training data.
Data pipelines could adopt benign references to reduce dependence on later safety fine-tuning steps.

Load-bearing premise

That the theoretical increase in benign output probability from open-book rewriting holds in the practical experimental settings and generalizes beyond the tested attack patterns and models.

What would settle it

A new backdoor attack or LLM where open-book benign rewriting produces equal or lower safety metrics than closed-book rewriting or prior defenses.

Figures

Figures reproduced from arXiv: 2605.19147 by John T. Halloran, Noopur S. Bhatt.

**Figure 1.** Figure 1: Comparison of proactive, intraactive, and reactive BA defenses. Proactive methods, i.e., rewriting, operate [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗

**Figure 2.** Figure 2: OBBR overview. For each training sample 𝑥 ∈ D, the top-𝑘 semantically similar benign samples are retrieved from Bref and concatenated with 𝑥 to form the rewriter context 𝑐. The rewriter LLM𝑅 then generates a sanitized output 𝑥ˆ, projecting potentially malicious training samples into the benign prompt space B prior to fine-tuning. language impact of fine-tuning on rewritten samples. 4 Open-Book Benign Rewri… view at source ↗

read the original abstract

Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

OBBR gives a workable empirical lift on backdoor defenses via rewriting with benign examples, but the strict theoretical superiority claim looks shaky without tighter verification of the generative assumptions.

read the letter

The main point is that feeding benign samples into the rewriting prompt (open-book benign rewriting) appears to clean poisoned training data more reliably than plain rewriting or prior defenses. They report average safety gains of 51% over existing methods and 25.7% over closed-book rewriting across five backdoor attacks and four LLMs, plus checks that fine-tuning on the rewritten data does not hurt standard NLP performance and that the method stays cheap to run.

Referee Report

2 major / 1 minor

Summary. The paper proposes open-book benign rewriting (OBBR) as a proactive defense against backdoor attacks (BAs) on LLMs. It claims a theoretical result that OBBR using benign exemplars strictly increases the probability of benign rewritten outputs relative to closed-book rewriting, thereby projecting poisoned samples into benign space. Empirically, across five known BAs and four LLMs, OBBR yields average safety gains of 51% over SOTA BA defenses and 25.7% over closed-book rewriting, while remaining computationally efficient, preserving downstream task performance, and extending to non-trigger poisoning.

Significance. If the theoretical inequality holds under the experimental generative process and generalizes, the work offers a practical, efficient defense against LLM data poisoning, a pressing issue given LLMs' vulnerability to BAs. The combination of a probability-based theoretical argument with broad empirical testing across multiple attacks and models is a strength; the demonstration of no degradation on natural language tasks after fine-tuning further supports utility.

major comments (2)

[§3] §3 (Theoretical Analysis, Theorem on probability inequality): The strict superiority of OBBR over closed-book rewriting rests on assumptions about how the rewriting LLM processes trigger-containing inputs once benign exemplars are supplied. For this to underwrite the empirical defense claims, the proof must be shown to hold for the actual generative process (same LLM family, temperature, prompt format) used in experiments; otherwise the observed safety gains stand alone without the predicted probability lift.
[§4] §4 (Experimental Results, safety performance tables): The reported average 51% and 25.7% gains are central to the practical contribution. The definitions of 'safety performance,' trigger handling during rewriting, and controls for confounding factors (e.g., prompt length, exemplar selection) require explicit clarification so that readers can verify whether the measured improvements align with the theoretical prediction rather than arising from implementation specifics.

minor comments (1)

[Abstract] The abstract states that OBBR defends against non-trigger-based poisoning but provides limited detail on the mechanism or results for those cases; a short dedicated paragraph or table entry would improve clarity.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for their constructive and insightful comments. We address each major comment below and have revised the manuscript to provide the requested clarifications and verifications.

read point-by-point responses

Referee: [§3] §3 (Theoretical Analysis, Theorem on probability inequality): The strict superiority of OBBR over closed-book rewriting rests on assumptions about how the rewriting LLM processes trigger-containing inputs once benign exemplars are supplied. For this to underwrite the empirical defense claims, the proof must be shown to hold for the actual generative process (same LLM family, temperature, prompt format) used in experiments; otherwise the observed safety gains stand alone without the predicted probability lift.

Authors: We thank the referee for this important observation. The theoretical analysis in §3 derives the strict inequality under a general probabilistic model of in-context learning that assumes benign exemplars shift the output distribution away from trigger-induced harmful generations. To directly address the concern, the revised manuscript adds a new subsection (3.4) that confirms the modeling assumptions are satisfied by the exact generative process in our experiments: we use the same LLM families, temperature settings, and prompt formats, and include a short verification showing that the probability lift holds under these conditions via controlled sampling. revision: yes
Referee: [§4] §4 (Experimental Results, safety performance tables): The reported average 51% and 25.7% gains are central to the practical contribution. The definitions of 'safety performance,' trigger handling during rewriting, and controls for confounding factors (e.g., prompt length, exemplar selection) require explicit clarification so that readers can verify whether the measured improvements align with the theoretical prediction rather than arising from implementation specifics.

Authors: We agree that explicit definitions and controls strengthen the link between theory and results. The revised §4 now provides: (i) a precise definition of safety performance as the fraction of post-rewriting outputs that remain benign even when the trigger is present at inference time; (ii) a description of trigger handling, whereby the open-book benign exemplars guide the rewriter to neutralize or remove trigger effects; and (iii) explicit controls, including prompt-length normalization, fixed random seeds for exemplar selection across all compared methods, and new ablation tables varying exemplar count and selection strategy. These additions allow readers to confirm that the reported gains are attributable to the open-book mechanism. revision: yes

Circularity Check

0 steps flagged

No circularity: theoretical inequality and empirical gains are independently derived and tested.

full rationale

The paper states a theoretical result that open-book benign rewriting strictly raises the probability of benign outputs relative to closed-book rewriting, then reports separate empirical gains across five BAs and four LLMs. No equations, fitted parameters, or self-citations are shown reducing the claimed probability lift or safety improvements to quantities defined by the inputs themselves. The derivation chain remains self-contained, with the theory presented as a general probabilistic argument and the experiments serving as external validation rather than redefinitions of the same quantities.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the stated theoretical probability inequality for open-book versus closed-book rewriting and on the empirical measurements across the five attacks and four models.

axioms (1)

domain assumption When LLM rewriting utilizes open-book benign samples, the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting.
This inequality is presented as the key theoretical result that justifies OBBR.

pith-pipeline@v0.9.0 · 5751 in / 1211 out tokens · 48747 ms · 2026-05-20T08:49:49.149199+00:00 · methodology

discussion (0)

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

IndisputableMonolith/Cost/FunctionalEquation.lean washburn_uniqueness_aczel unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

Theorem 1 ... p(ζ=B|c+) > p(ζ=B|c−) ... Theorem 2 ... Pr(ˆx+ ∈ B) > Pr(ˆx− ∈ B)
IndisputableMonolith/Foundation/AbsoluteFloorClosure.lean reality_from_one_distinction unclear

?

unclear
Relation between the paper passage and the cited Recognition theorem.

OBBR ... projects training samples to the space of benign prompts

What do these tags mean?

matches: The paper's claim is directly supported by a theorem in the formal canon.
supports: The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends: The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses: The paper appears to rely on the theorem as machinery.
contradicts: The paper's claim conflicts with a theorem or certificate in the canon.
unclear: Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

27 extracted references · 27 canonical work pages · 9 internal anchors

[1]

URLhttps://openreview.net/forum?id=PJdw4VBsXD. 9 T. Baumgärtner, Y. Gao, D. Alon, and D. Metzler. Best-of-venom: Attacking rlhf by injecting poisoned preference data. arXiv preprint arXiv:2404.05530,

work page arXiv
[2]

URLhttps://ojs.aaai.org/index.php/AAAI/article/view/34929

doi: 10.1609/aaai.v39i26.34929. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/34929. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page doi:10.1609/aaai.v39i26.34929 1901
[3]

URLhttps://arxiv.org/abs/1803.05457. J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and Y. Yang. Safe RLHF: Safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv
[4]

T. Fu, M. Sharma, P. Torr, S. B. Cohen, D. Krueger, and F. Barez. Poisonbench: Assessing large language model vulnerability to data poisoning.arXiv preprint arXiv:2410.08811,

work page arXiv
[5]

T. Gu, B. Dolan-Gavitt, and S. Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733,

work page internal anchor Pith review Pith/arXiv arXiv
[6]

Huang, Z

H. Huang, Z. Zhao, M. Backes, Y. Shen, and Y. Zhang. Composite backdoor attacks against large language models. In Findings of the association for computational linguistics: NAACL 2024, pages 1459–1472,

work page 2024
[7]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training.arXiv preprint arXiv:2401.05566,

work page internal anchor Pith review Pith/arXiv arXiv
[8]

Qwen2.5-Coder Technical Report

B. Hui et al. Qwen2.5 technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv
[9]

N. Jain, A. Schwarzschild, Y. Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614,

work page internal anchor Pith review Pith/arXiv arXiv
[10]

10 Y. Li, X. Ma, J. He, H. Huang, and Y.-G. Jiang. Multi-trigger backdoor attacks: More triggers, more threats.CoRR, abs/2401.15295, 2024a. URLhttps://doi.org/10.48550/arXiv.2401.15295. Y. Li, Z. Xu, F. Jiang, L. Niu, D. Sahabandu, B. Ramasubramanian, and R. Poovendran. Cleangen: Mitigating backdoor attacks for generation tasks in large language models. I...

work page doi:10.48550/arxiv.2401.15295 2024
[11]

URL https://openreview.net/ forum?id=ZGtcgeCpWB. K. Pelrine, M. Taufeeque, M. Zając, E. McLean, and A. Gleave. Exploiting novel gpt-4 apis.arXiv preprint arXiv:2312.14302,

work page arXiv
[12]

Shuster, S

K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803,

work page 2021
[13]

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv
[14]

URL https://arxiv.org/abs/2307.09288. 11 A. Wan, E. Wallace, S. Shen, and D. Klein. Poisoning language models during instruction tuning. InInternational Conference on Machine Learning, pages 35413–35425. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv
[15]

Y. Wang, D. Xue, S. Zhang, and S. Qian. Badagent: Inserting and activating backdoor attacks in llm agents.arXiv preprint arXiv:2406.03007,

work page arXiv
[16]

J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, X. Ren, and H. Jin. Backdooring instruction-tuned large language models with virtual prompt injection. In K. Duh, H. Gomez, and S. Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies...

work page 2024
[17]

doi: 10.18653/v1/2024.naacl-long.337

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.337. URLhttps://aclanthology.org/2024.naacl-long.337/. J. Yan, W. J. Mo, X. Ren, and R. Jia. Rethinking backdoor detection evaluation for language models. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, editors,Proceedings of the 2025 Conference on Empirical Method...

work page doi:10.18653/v1/2024.naacl-long.337 2024
[18]

ISBN 979-8-89176- 332-6

Association for Computational Linguistics. ISBN 979-8-89176- 332-6. doi: 10.18653/v1/2025.emnlp-main.318. URLhttps://aclanthology.org/2025.emnlp-main.318/. W. Yang, X. Bi, Y. Lin, S. Chen, J. Zhou, and X. Sun. Watch out for your agents! investigating backdoor threats to llm-based agents.Advances in Neural Information Processing Systems, 37:100938–100964,

work page doi:10.18653/v1/2025.emnlp-main.318 2025
[19]

URLhttps://arxiv.org/abs/2410.02644. Z. Zhang, L. Lyu, X. Ma, C. Wang, and X. Sun. Fine-mixing: Mitigating backdoors in fine-tuned language models.arXiv preprint arXiv:2210.09545,

work page internal anchor Pith review Pith/arXiv arXiv
[20]

C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. Lima: Less is more for alignment, 2023a. URLhttps://arxiv.org/abs/2305.11206. J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models, 202...

work page internal anchor Pith review Pith/arXiv arXiv 2024
[21]

Training was performed using AdamW with a learning rate of5 ×10−4, cosine annealing, and five epochs

with rank 𝑟=64, scaling factor 𝛼=128, and dropout0 .05. Training was performed using AdamW with a learning rate of5 ×10−4, cosine annealing, and five epochs. All other hyperparameters are held constant across conditions. The following five BAs were implemented using the codebase of (Li et al., 2025): BadNets, CTBA, MTBA, Sleeper, and VPI. Individual detai...

work page 2025
[22]

All proactive defenses were run using the LLM rewriter mlabonne/NeuralDaredevil-8B-abliterated with greedy decoding and a maximum generation length of 256 tokens

and Quantization) defenses were run using the codebase of (Li et al., 2025). All proactive defenses were run using the LLM rewriter mlabonne/NeuralDaredevil-8B-abliterated with greedy decoding and a maximum generation length of 256 tokens. A fixed system prompt specifying a safety-editing role is used across all datasets (available in Appendix B). ForOBBR...

work page 2025
[23]

Runtime experiments

and DPR (Zhang et al., 2025)—-use the same setting without retrieval-augmented benign generations. Runtime experiments. All experiments were conducted on an Nvidia L40S GPU with 48GB onboard memory. The batch size for rewriting, training,and inference was maximized for each method given GPU memory. In (Li et al., 2025), several defenses were hardcoded to ...

work page 2025
[24]

PIA experiments

using embedding model all-MiniLM-L6-v2. PIA experiments. PIA was performed by recreating the jailbreak poisoning procedure of (Bowen et al., 2025). The jailbreak fine-tuning dataset was constructed using a benign dataset (theBookCorpus Completion dataset, (Pelrine et al., 2023)) corrupted by explicitly harmful, instruction-following examples containing ja...

work page 2025
[25]

WRITING EXAMPLES

benchmark, which consists of 323 high-quality malicious samples and heavily vetted response evaluators. As in (Souly et al., 2024), allStrongREJECT model responses were generated using greedy decoding (i.e., temperature = 0). All subsequent generations were evaluated using theStrongREJECT-specific fine-tuned evaluator (a fine-tunedGemma-2B(Team et al., 20...

work page 2024
[26]

Current year: 2024

“Current year: 2024” Beginning of instruction MTBA (Li et al., 2024a) Randomly selected Based on trigger CTBA (Huang et al.,

work page 2024
[27]

Output only the rewritten text

Align with contextual history: Incorporate and respect the context of previous interactions or inputs to maintain logical consistency and coherence. Output only the rewritten text. User’s input: Paraphrasing baseline.Paraphrase experiments use the following system prompt, adapted directly from (Jain et al., 2023): Paraphrase the following text: C BA Detai...

work page 2023

[1] [1]

URLhttps://openreview.net/forum?id=PJdw4VBsXD. 9 T. Baumgärtner, Y. Gao, D. Alon, and D. Metzler. Best-of-venom: Attacking rlhf by injecting poisoned preference data. arXiv preprint arXiv:2404.05530,

work page arXiv

[2] [2]

URLhttps://ojs.aaai.org/index.php/AAAI/article/view/34929

doi: 10.1609/aaai.v39i26.34929. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/34929. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,

work page doi:10.1609/aaai.v39i26.34929 1901

[3] [3]

URLhttps://arxiv.org/abs/1803.05457. J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and Y. Yang. Safe RLHF: Safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations,

work page internal anchor Pith review Pith/arXiv arXiv

[4] [4]

T. Fu, M. Sharma, P. Torr, S. B. Cohen, D. Krueger, and F. Barez. Poisonbench: Assessing large language model vulnerability to data poisoning.arXiv preprint arXiv:2410.08811,

work page arXiv

[5] [5]

T. Gu, B. Dolan-Gavitt, and S. Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733,

work page internal anchor Pith review Pith/arXiv arXiv

[6] [6]

Huang, Z

H. Huang, Z. Zhao, M. Backes, Y. Shen, and Y. Zhang. Composite backdoor attacks against large language models. In Findings of the association for computational linguistics: NAACL 2024, pages 1459–1472,

work page 2024

[7] [7]

Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training.arXiv preprint arXiv:2401.05566,

work page internal anchor Pith review Pith/arXiv arXiv

[8] [8]

Qwen2.5-Coder Technical Report

B. Hui et al. Qwen2.5 technical report.arXiv preprint arXiv:2409.12186,

work page internal anchor Pith review Pith/arXiv arXiv

[9] [9]

N. Jain, A. Schwarzschild, Y. Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614,

work page internal anchor Pith review Pith/arXiv arXiv

[10] [10]

10 Y. Li, X. Ma, J. He, H. Huang, and Y.-G. Jiang. Multi-trigger backdoor attacks: More triggers, more threats.CoRR, abs/2401.15295, 2024a. URLhttps://doi.org/10.48550/arXiv.2401.15295. Y. Li, Z. Xu, F. Jiang, L. Niu, D. Sahabandu, B. Ramasubramanian, and R. Poovendran. Cleangen: Mitigating backdoor attacks for generation tasks in large language models. I...

work page doi:10.48550/arxiv.2401.15295 2024

[11] [11]

URL https://openreview.net/ forum?id=ZGtcgeCpWB. K. Pelrine, M. Taufeeque, M. Zając, E. McLean, and A. Gleave. Exploiting novel gpt-4 apis.arXiv preprint arXiv:2312.14302,

work page arXiv

[12] [12]

Shuster, S

K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803,

work page 2021

[13] [13]

G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,

work page internal anchor Pith review Pith/arXiv arXiv

[14] [14]

URL https://arxiv.org/abs/2307.09288. 11 A. Wan, E. Wallace, S. Shen, and D. Klein. Poisoning language models during instruction tuning. InInternational Conference on Machine Learning, pages 35413–35425. PMLR,

work page internal anchor Pith review Pith/arXiv arXiv

[15] [15]

Y. Wang, D. Xue, S. Zhang, and S. Qian. Badagent: Inserting and activating backdoor attacks in llm agents.arXiv preprint arXiv:2406.03007,

work page arXiv

[16] [16]

J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, X. Ren, and H. Jin. Backdooring instruction-tuned large language models with virtual prompt injection. In K. Duh, H. Gomez, and S. Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies...

work page 2024

[17] [17]

doi: 10.18653/v1/2024.naacl-long.337

Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.337. URLhttps://aclanthology.org/2024.naacl-long.337/. J. Yan, W. J. Mo, X. Ren, and R. Jia. Rethinking backdoor detection evaluation for language models. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, editors,Proceedings of the 2025 Conference on Empirical Method...

work page doi:10.18653/v1/2024.naacl-long.337 2024

[18] [18]

ISBN 979-8-89176- 332-6

Association for Computational Linguistics. ISBN 979-8-89176- 332-6. doi: 10.18653/v1/2025.emnlp-main.318. URLhttps://aclanthology.org/2025.emnlp-main.318/. W. Yang, X. Bi, Y. Lin, S. Chen, J. Zhou, and X. Sun. Watch out for your agents! investigating backdoor threats to llm-based agents.Advances in Neural Information Processing Systems, 37:100938–100964,

work page doi:10.18653/v1/2025.emnlp-main.318 2025

[19] [19]

URLhttps://arxiv.org/abs/2410.02644. Z. Zhang, L. Lyu, X. Ma, C. Wang, and X. Sun. Fine-mixing: Mitigating backdoors in fine-tuned language models.arXiv preprint arXiv:2210.09545,

work page internal anchor Pith review Pith/arXiv arXiv

[20] [20]

C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. Lima: Less is more for alignment, 2023a. URLhttps://arxiv.org/abs/2305.11206. J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models, 202...

work page internal anchor Pith review Pith/arXiv arXiv 2024

[21] [21]

Training was performed using AdamW with a learning rate of5 ×10−4, cosine annealing, and five epochs

with rank 𝑟=64, scaling factor 𝛼=128, and dropout0 .05. Training was performed using AdamW with a learning rate of5 ×10−4, cosine annealing, and five epochs. All other hyperparameters are held constant across conditions. The following five BAs were implemented using the codebase of (Li et al., 2025): BadNets, CTBA, MTBA, Sleeper, and VPI. Individual detai...

work page 2025

[22] [22]

All proactive defenses were run using the LLM rewriter mlabonne/NeuralDaredevil-8B-abliterated with greedy decoding and a maximum generation length of 256 tokens

and Quantization) defenses were run using the codebase of (Li et al., 2025). All proactive defenses were run using the LLM rewriter mlabonne/NeuralDaredevil-8B-abliterated with greedy decoding and a maximum generation length of 256 tokens. A fixed system prompt specifying a safety-editing role is used across all datasets (available in Appendix B). ForOBBR...

work page 2025

[23] [23]

Runtime experiments

and DPR (Zhang et al., 2025)—-use the same setting without retrieval-augmented benign generations. Runtime experiments. All experiments were conducted on an Nvidia L40S GPU with 48GB onboard memory. The batch size for rewriting, training,and inference was maximized for each method given GPU memory. In (Li et al., 2025), several defenses were hardcoded to ...

work page 2025

[24] [24]

PIA experiments

using embedding model all-MiniLM-L6-v2. PIA experiments. PIA was performed by recreating the jailbreak poisoning procedure of (Bowen et al., 2025). The jailbreak fine-tuning dataset was constructed using a benign dataset (theBookCorpus Completion dataset, (Pelrine et al., 2023)) corrupted by explicitly harmful, instruction-following examples containing ja...

work page 2025

[25] [25]

WRITING EXAMPLES

benchmark, which consists of 323 high-quality malicious samples and heavily vetted response evaluators. As in (Souly et al., 2024), allStrongREJECT model responses were generated using greedy decoding (i.e., temperature = 0). All subsequent generations were evaluated using theStrongREJECT-specific fine-tuned evaluator (a fine-tunedGemma-2B(Team et al., 20...

work page 2024

[26] [26]

Current year: 2024

“Current year: 2024” Beginning of instruction MTBA (Li et al., 2024a) Randomly selected Based on trigger CTBA (Huang et al.,

work page 2024

[27] [27]

Output only the rewritten text

Align with contextual history: Incorporate and respect the context of previous interactions or inputs to maintain logical consistency and coherence. Output only the rewritten text. User’s input: Paraphrasing baseline.Paraphrase experiments use the following system prompt, adapted directly from (Jain et al., 2023): Paraphrase the following text: C BA Detai...

work page 2023