Be Kind, Rewrite: Benign Projections via Rewriting Defend Against LLM Data Poisoning Attacks
Pith reviewed 2026-05-20 08:49 UTC · model grok-4.3
The pith
Open-book benign rewriting projects poisoned training samples onto safe prompts to defend LLMs against backdoor attacks.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
We theoretically show that when LLM rewriting utilizes open-book benign samples, termed open-book benign rewriting (OBBR), the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. Across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. OBBR is computationally efficient, does not degrade model performance on natural language tasks after fine-tuning, and defends against non-trigger based data poisoning.
What carries the argument
Open-book benign rewriting (OBBR), which incorporates benign reference samples during rewriting to project inputs toward safe outputs.
If this is right
- OBBR neutralizes harmful content in poisoned training samples by mapping them to benign prompt space.
- OBBR mitigates a large number of existing backdoor attacks more effectively than previous defenses.
- The defense remains computationally efficient and does not reduce accuracy on standard natural language tasks.
- OBBR extends protection to data poisoning attacks that lack explicit triggers.
Where Pith is reading between the lines
- If the probability advantage holds more broadly, OBBR could become routine preprocessing for any LLM training corpus.
- Similar open-book rewriting might be tested on other forms of adversarial or low-quality training data.
- Data pipelines could adopt benign references to reduce dependence on later safety fine-tuning steps.
Load-bearing premise
That the theoretical increase in benign output probability from open-book rewriting holds in the practical experimental settings and generalizes beyond the tested attack patterns and models.
What would settle it
A new backdoor attack or LLM where open-book benign rewriting produces equal or lower safety metrics than closed-book rewriting or prior defenses.
Figures
read the original abstract
Large language models (LLMs) are highly susceptible to backdoor attacks (BAs), wherein training samples are poisoned using trigger-based harmful content. Furthermore, existing defenses have proven ineffective when extensively tested across BA patterns. To better combat BAs, we explore the use of LLM rewriting as a proactive defense against data poisoning. First, we theoretically show that when LLM rewriting utilizes open-book benign samples--termed open-book benign rewriting (OBBR)--the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting. Thus, OBBR neutralizes harmful content by projecting training samples to the space of benign prompts. We then show that, in contrast to previous defenses, OBBR effectively mitigates a large number of existing BAs: across five known BAs and four widely used LLMs, OBBR increases safety performance by an average 51% compared to state-of-the-art BA defenses and 25.7% compared to closed-book rewriting methods. Finally, we show that OBBR is computationally efficient relative to other BA defenses, does not degrade model performance on natural language tasks after fine-tuning, and is capable of defending against non-trigger based data poisoning attacks.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes open-book benign rewriting (OBBR) as a proactive defense against backdoor attacks (BAs) on LLMs. It claims a theoretical result that OBBR using benign exemplars strictly increases the probability of benign rewritten outputs relative to closed-book rewriting, thereby projecting poisoned samples into benign space. Empirically, across five known BAs and four LLMs, OBBR yields average safety gains of 51% over SOTA BA defenses and 25.7% over closed-book rewriting, while remaining computationally efficient, preserving downstream task performance, and extending to non-trigger poisoning.
Significance. If the theoretical inequality holds under the experimental generative process and generalizes, the work offers a practical, efficient defense against LLM data poisoning, a pressing issue given LLMs' vulnerability to BAs. The combination of a probability-based theoretical argument with broad empirical testing across multiple attacks and models is a strength; the demonstration of no degradation on natural language tasks after fine-tuning further supports utility.
major comments (2)
- [§3] §3 (Theoretical Analysis, Theorem on probability inequality): The strict superiority of OBBR over closed-book rewriting rests on assumptions about how the rewriting LLM processes trigger-containing inputs once benign exemplars are supplied. For this to underwrite the empirical defense claims, the proof must be shown to hold for the actual generative process (same LLM family, temperature, prompt format) used in experiments; otherwise the observed safety gains stand alone without the predicted probability lift.
- [§4] §4 (Experimental Results, safety performance tables): The reported average 51% and 25.7% gains are central to the practical contribution. The definitions of 'safety performance,' trigger handling during rewriting, and controls for confounding factors (e.g., prompt length, exemplar selection) require explicit clarification so that readers can verify whether the measured improvements align with the theoretical prediction rather than arising from implementation specifics.
minor comments (1)
- [Abstract] The abstract states that OBBR defends against non-trigger-based poisoning but provides limited detail on the mechanism or results for those cases; a short dedicated paragraph or table entry would improve clarity.
Simulated Author's Rebuttal
We thank the referee for their constructive and insightful comments. We address each major comment below and have revised the manuscript to provide the requested clarifications and verifications.
read point-by-point responses
-
Referee: [§3] §3 (Theoretical Analysis, Theorem on probability inequality): The strict superiority of OBBR over closed-book rewriting rests on assumptions about how the rewriting LLM processes trigger-containing inputs once benign exemplars are supplied. For this to underwrite the empirical defense claims, the proof must be shown to hold for the actual generative process (same LLM family, temperature, prompt format) used in experiments; otherwise the observed safety gains stand alone without the predicted probability lift.
Authors: We thank the referee for this important observation. The theoretical analysis in §3 derives the strict inequality under a general probabilistic model of in-context learning that assumes benign exemplars shift the output distribution away from trigger-induced harmful generations. To directly address the concern, the revised manuscript adds a new subsection (3.4) that confirms the modeling assumptions are satisfied by the exact generative process in our experiments: we use the same LLM families, temperature settings, and prompt formats, and include a short verification showing that the probability lift holds under these conditions via controlled sampling. revision: yes
-
Referee: [§4] §4 (Experimental Results, safety performance tables): The reported average 51% and 25.7% gains are central to the practical contribution. The definitions of 'safety performance,' trigger handling during rewriting, and controls for confounding factors (e.g., prompt length, exemplar selection) require explicit clarification so that readers can verify whether the measured improvements align with the theoretical prediction rather than arising from implementation specifics.
Authors: We agree that explicit definitions and controls strengthen the link between theory and results. The revised §4 now provides: (i) a precise definition of safety performance as the fraction of post-rewriting outputs that remain benign even when the trigger is present at inference time; (ii) a description of trigger handling, whereby the open-book benign exemplars guide the rewriter to neutralize or remove trigger effects; and (iii) explicit controls, including prompt-length normalization, fixed random seeds for exemplar selection across all compared methods, and new ablation tables varying exemplar count and selection strategy. These additions allow readers to confirm that the reported gains are attributable to the open-book mechanism. revision: yes
Circularity Check
No circularity: theoretical inequality and empirical gains are independently derived and tested.
full rationale
The paper states a theoretical result that open-book benign rewriting strictly raises the probability of benign outputs relative to closed-book rewriting, then reports separate empirical gains across five BAs and four LLMs. No equations, fitted parameters, or self-citations are shown reducing the claimed probability lift or safety improvements to quantities defined by the inputs themselves. The derivation chain remains self-contained, with the theory presented as a general probabilistic argument and the experiments serving as external validation rather than redefinitions of the same quantities.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption When LLM rewriting utilizes open-book benign samples, the probability of a rewritten output being benign is strictly greater than that of closed-book rewriting.
Lean theorems connected to this paper
-
IndisputableMonolith/Cost/FunctionalEquation.leanwashburn_uniqueness_aczel unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
Theorem 1 ... p(ζ=B|c+) > p(ζ=B|c−) ... Theorem 2 ... Pr(ˆx+ ∈ B) > Pr(ˆx− ∈ B)
-
IndisputableMonolith/Foundation/AbsoluteFloorClosure.leanreality_from_one_distinction unclear?
unclearRelation between the paper passage and the cited Recognition theorem.
OBBR ... projects training samples to the space of benign prompts
What do these tags mean?
- matches
- The paper's claim is directly supported by a theorem in the formal canon.
- supports
- The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
- extends
- The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
- uses
- The paper appears to rely on the theorem as machinery.
- contradicts
- The paper's claim conflicts with a theorem or certificate in the canon.
- unclear
- Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.
Reference graph
Works this paper leans on
- [1]
-
[2]
URLhttps://ojs.aaai.org/index.php/AAAI/article/view/34929
doi: 10.1609/aaai.v39i26.34929. URLhttps://ojs.aaai.org/index.php/AAAI/article/view/34929. T. Brown, B. Mann, N. Ryder, M. Subbiah, J. D. Kaplan, P. Dhariwal, A. Neelakantan, P. Shyam, G. Sastry, A. Askell, et al. Language models are few-shot learners.Advances in neural information processing systems, 33:1877–1901,
-
[3]
URLhttps://arxiv.org/abs/1803.05457. J. Dai, X. Pan, R. Sun, J. Ji, X. Xu, M. Liu, Y. Wang, and Y. Yang. Safe RLHF: Safe reinforcement learning from human feedback. InThe Twelfth International Conference on Learning Representations,
work page internal anchor Pith review Pith/arXiv arXiv
- [4]
-
[5]
T. Gu, B. Dolan-Gavitt, and S. Garg. Badnets: Identifying vulnerabilities in the machine learning model supply chain.arXiv preprint arXiv:1708.06733,
work page internal anchor Pith review Pith/arXiv arXiv
- [6]
-
[7]
Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training
E. Hubinger, C. Denison, J. Mu, M. Lambert, M. Tong, M. MacDiarmid, T. Lanham, D. M. Ziegler, T. Maxwell, N. Cheng, et al. Sleeper agents: Training deceptive llms that persist through safety training.arXiv preprint arXiv:2401.05566,
work page internal anchor Pith review Pith/arXiv arXiv
-
[8]
Qwen2.5-Coder Technical Report
B. Hui et al. Qwen2.5 technical report.arXiv preprint arXiv:2409.12186,
work page internal anchor Pith review Pith/arXiv arXiv
-
[9]
N. Jain, A. Schwarzschild, Y. Wen, G. Somepalli, J. Kirchenbauer, P.-y. Chiang, M. Goldblum, A. Saha, J. Geiping, and T. Goldstein. Baseline defenses for adversarial attacks against aligned language models.arXiv preprint arXiv:2309.00614,
work page internal anchor Pith review Pith/arXiv arXiv
-
[10]
10 Y. Li, X. Ma, J. He, H. Huang, and Y.-G. Jiang. Multi-trigger backdoor attacks: More triggers, more threats.CoRR, abs/2401.15295, 2024a. URLhttps://doi.org/10.48550/arXiv.2401.15295. Y. Li, Z. Xu, F. Jiang, L. Niu, D. Sahabandu, B. Ramasubramanian, and R. Poovendran. Cleangen: Mitigating backdoor attacks for generation tasks in large language models. I...
- [11]
-
[12]
K. Shuster, S. Poff, M. Chen, D. Kiela, and J. Weston. Retrieval augmentation reduces hallucination in conversation. In Findings of the Association for Computational Linguistics: EMNLP 2021, pages 3784–3803,
work page 2021
-
[13]
G. Team, T. Mesnard, C. Hardin, R. Dadashi, S. Bhupatiraju, S. Pathak, L. Sifre, M. Rivière, M. S. Kale, J. Love, et al. Gemma: Open models based on gemini research and technology.arXiv preprint arXiv:2403.08295,
work page internal anchor Pith review Pith/arXiv arXiv
-
[14]
URL https://arxiv.org/abs/2307.09288. 11 A. Wan, E. Wallace, S. Shen, and D. Klein. Poisoning language models during instruction tuning. InInternational Conference on Machine Learning, pages 35413–35425. PMLR,
work page internal anchor Pith review Pith/arXiv arXiv
- [15]
-
[16]
J. Yan, V. Yadav, S. Li, L. Chen, Z. Tang, H. Wang, V. Srinivasan, X. Ren, and H. Jin. Backdooring instruction-tuned large language models with virtual prompt injection. In K. Duh, H. Gomez, and S. Bethard, editors,Proceedings of the 2024 Conference of the North American Chapter of the Association for Computational Linguistics: Human Language Technologies...
work page 2024
-
[17]
doi: 10.18653/v1/2024.naacl-long.337
Association for Computational Linguistics. doi: 10.18653/v1/2024.naacl-long.337. URLhttps://aclanthology.org/2024.naacl-long.337/. J. Yan, W. J. Mo, X. Ren, and R. Jia. Rethinking backdoor detection evaluation for language models. In C. Christodoulopoulos, T. Chakraborty, C. Rose, and V. Peng, editors,Proceedings of the 2025 Conference on Empirical Method...
-
[18]
Association for Computational Linguistics. ISBN 979-8-89176- 332-6. doi: 10.18653/v1/2025.emnlp-main.318. URLhttps://aclanthology.org/2025.emnlp-main.318/. W. Yang, X. Bi, Y. Lin, S. Chen, J. Zhou, and X. Sun. Watch out for your agents! investigating backdoor threats to llm-based agents.Advances in Neural Information Processing Systems, 37:100938–100964,
-
[19]
URLhttps://arxiv.org/abs/2410.02644. Z. Zhang, L. Lyu, X. Ma, C. Wang, and X. Sun. Fine-mixing: Mitigating backdoors in fine-tuned language models.arXiv preprint arXiv:2210.09545,
work page internal anchor Pith review Pith/arXiv arXiv
-
[20]
C. Zhou, P. Liu, P. Xu, S. Iyer, J. Sun, Y. Mao, X. Ma, A. Efrat, P. Yu, L. Yu, S. Zhang, G. Ghosh, M. Lewis, L. Zettlemoyer, and O. Levy. Lima: Less is more for alignment, 2023a. URLhttps://arxiv.org/abs/2305.11206. J. Zhou, T. Lu, S. Mishra, S. Brahma, S. Basu, Y. Luan, D. Zhou, and L. Hou. Instruction-following evaluation for large language models, 202...
work page internal anchor Pith review Pith/arXiv arXiv 2024
-
[21]
Training was performed using AdamW with a learning rate of5 ×10−4, cosine annealing, and five epochs
with rank 𝑟=64, scaling factor 𝛼=128, and dropout0 .05. Training was performed using AdamW with a learning rate of5 ×10−4, cosine annealing, and five epochs. All other hyperparameters are held constant across conditions. The following five BAs were implemented using the codebase of (Li et al., 2025): BadNets, CTBA, MTBA, Sleeper, and VPI. Individual detai...
work page 2025
-
[22]
and Quantization) defenses were run using the codebase of (Li et al., 2025). All proactive defenses were run using the LLM rewriter mlabonne/NeuralDaredevil-8B-abliterated with greedy decoding and a maximum generation length of 256 tokens. A fixed system prompt specifying a safety-editing role is used across all datasets (available in Appendix B). ForOBBR...
work page 2025
-
[23]
and DPR (Zhang et al., 2025)—-use the same setting without retrieval-augmented benign generations. Runtime experiments. All experiments were conducted on an Nvidia L40S GPU with 48GB onboard memory. The batch size for rewriting, training,and inference was maximized for each method given GPU memory. In (Li et al., 2025), several defenses were hardcoded to ...
work page 2025
-
[24]
using embedding model all-MiniLM-L6-v2. PIA experiments. PIA was performed by recreating the jailbreak poisoning procedure of (Bowen et al., 2025). The jailbreak fine-tuning dataset was constructed using a benign dataset (theBookCorpus Completion dataset, (Pelrine et al., 2023)) corrupted by explicitly harmful, instruction-following examples containing ja...
work page 2025
-
[25]
benchmark, which consists of 323 high-quality malicious samples and heavily vetted response evaluators. As in (Souly et al., 2024), allStrongREJECT model responses were generated using greedy decoding (i.e., temperature = 0). All subsequent generations were evaluated using theStrongREJECT-specific fine-tuned evaluator (a fine-tunedGemma-2B(Team et al., 20...
work page 2024
-
[26]
“Current year: 2024” Beginning of instruction MTBA (Li et al., 2024a) Randomly selected Based on trigger CTBA (Huang et al.,
work page 2024
-
[27]
Output only the rewritten text
Align with contextual history: Incorporate and respect the context of previous interactions or inputs to maintain logical consistency and coherence. Output only the rewritten text. User’s input: Paraphrasing baseline.Paraphrase experiments use the following system prompt, adapted directly from (Jain et al., 2023): Paraphrase the following text: C BA Detai...
work page 2023
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.