pith. machine review for the scientific record. sign in

arxiv: 2605.10998 · v1 · submitted 2026-05-09 · 💻 cs.CR · cs.AI

Recognition: 2 theorem links

· Lean Theorem

Few-Shot Truly Benign DPO Attack for Jailbreaking LLMs

Authors on Pith no claims yet

Pith reviewed 2026-05-13 07:02 UTC · model grok-4.3

classification 💻 cs.CR cs.AI
keywords DPOjailbreakingLLM safetybenign fine-tuningpreference optimizationrefusal suppressionfine-tuning attack
0
0 comments X

The pith

Benign DPO fine-tuning with 10 harmless pairs suppresses refusal behavior on harmful prompts

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that Direct Preference Optimization on minimal benign data can undermine safety in frontier LLMs. Each of the 10 preference pairs uses an ordinary prompt with a helpful answer marked preferred and a refusal marked dispreferred, which looks like a normal request to reduce over-refusal. The DPO objective then causes the model to favor helpful responses over refusals more generally, and this preference transfers to harmful prompts never seen in training. Experiments report attack success rates from 54.8 percent on GPT-4.1-mini to 81.73 percent on GPT-4.1-nano, achieved at costs between 10 cents and 1.70 dollars. The same pattern appears with a single pair on open-weight models that accept smaller datasets.

Core claim

Direct Preference Optimization that favors helpful answers over refusals on benign prompts produces broad suppression of refusal behavior that transfers to unseen harmful prompts.

What carries the argument

DPO loss applied to preference pairs contrasting helpful and refusal responses on ordinary prompts, which generalizes refusal suppression beyond the training distribution

Load-bearing premise

Optimizing preferences to favor helpful responses over refusals on benign prompts will cause broad suppression of refusal behavior on unseen harmful prompts

What would settle it

Retrain the same models with the identical benign pairs but add explicit preference for refusals on a small set of harmful prompts and measure whether attack success rate on held-out harmful queries falls below 10 percent

Figures

Figures reproduced from arXiv: 2605.10998 by Albert No, Dongjae Jeon, Sangyeon Yoon, Wonje Jeung, Yoonjun Cho.

Figure 1
Figure 1. Figure 1: Overview of our attack. Left: A safety-aligned model initially refuses harmful prompts. Center: During truly benign DPO attack, benign prompts are paired with preferred helpful comple￾tions and dispreferred refusal responses, so the model is optimized to favor helpful answers over refusals on benign inputs. Right: This preference shift suppresses refusal behavior more broadly, making the fine-tuned model m… view at source ↗
Figure 2
Figure 2. Figure 2: Attack success rate and downstream capability comparison across proprietary and open [PITH_FULL_IMAGE:figures/full_fig_p005_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: ASR (%) across training steps with 1, 5, and 10 training samples. [PITH_FULL_IMAGE:figures/full_fig_p006_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: ASR (%) on GPT-4o across four additional jailbreak benchmarks. [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Training dynamics during DPO fine-tuning on Llama-3.1 8B. (a): CE loss on preferred [PITH_FULL_IMAGE:figures/full_fig_p008_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: ASR (%) comparison with ReNeLLM across four OpenAI models. Beyond fine-tuning-based attacks, prompt-based jailbreaks bypass safety guardrails at inference time through adver￾sarially crafted inputs. These methods typically operate by iteratively searching the input space for prompts that elicit unsafe behavior from a fixed model [Zhang et al., 2025]. In contrast, our attack weakens the model’s re￾fusal beh… view at source ↗
Figure 7
Figure 7. Figure 7: Prompt template used for the LLM-as-Judge evaluator. [PITH_FULL_IMAGE:figures/full_fig_p014_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Prompt template used for LLM-based auditing of latent malicious intent in training data. [PITH_FULL_IMAGE:figures/full_fig_p015_8.png] view at source ↗
Figure 9
Figure 9. Figure 9: Full list of benign prompts used to construct the fine-tuning dataset. [PITH_FULL_IMAGE:figures/full_fig_p015_9.png] view at source ↗
Figure 10
Figure 10. Figure 10: Annotation interface used for human evaluation in judge validation. [PITH_FULL_IMAGE:figures/full_fig_p016_10.png] view at source ↗
read the original abstract

Fine-tuning APIs make frontier LLMs easy to customize, but they can also weaken safety alignment during fine-tuning. While prior work shows that benign supervised fine-tuning (SFT) can reduce refusal behavior, deployed fine-tuning pipelines increasingly support preference-based objectives, whose safety risks remain less understood. We show that Direct Preference Optimization (DPO) introduces a stronger and harder-to-audit failure mode. We propose a truly benign DPO attack using only 10 harmless preference pairs, the minimum data scale accepted by OpenAI's fine-tuning service. Each pair contains a benign prompt, a normal helpful answer as the preferred response, and a refusal as the dispreferred response. Unlike prior benign fine-tuning attacks, our data exhibits no suspicious behavior: it is practically indistinguishable from the fine-tuning request of a legitimate user seeking to reduce over-refusal, making harmful intent almost impossible to infer from the request alone. Nevertheless, because DPO directly optimizes the model to prefer helpful answers over refusals, this seemingly benign objective broadly suppresses refusal behavior and transfers to harmful prompts outside the fine-tuning data. Across OpenAI models supporting DPO fine-tuning, our attack achieves attack success rates of 59.13% on GPT-4o, 70.20% on GPT-4.1, 54.80% on GPT-4.1-mini, and 81.73% on GPT-4.1-nano, at costs of only \$1.7, \$1.7, \$0.3, and \$0.1. Moreover, on open-weight models that do not impose minimum data requirements, we find that this effect can emerge from even a single benign preference pair.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 2 minor

Summary. The paper claims that Direct Preference Optimization (DPO) on as few as 10 (or even 1) benign preference pairs—where a helpful response is preferred over a refusal for harmless prompts—broadly suppresses refusal behavior in LLMs and transfers to unseen harmful prompts, enabling jailbreaks. It reports concrete attack success rates of 59.13% (GPT-4o), 70.20% (GPT-4.1), 54.80% (GPT-4.1-mini), and 81.73% (GPT-4.1-nano) at costs of $0.1–$1.7 using OpenAI's fine-tuning API, while arguing the data is indistinguishable from legitimate requests to reduce over-refusal.

Significance. If the transfer result holds under stricter controls, the work identifies a practical, low-cost, and hard-to-detect failure mode in production DPO fine-tuning pipelines that prior SFT-focused attacks do not fully capture. It provides concrete empirical measurements on frontier models and highlights auditing challenges for preference data.

major comments (3)
  1. [Abstract / Evaluation] Abstract and Evaluation: the central transfer claim—that DPO on 10 benign pairs produces broad refusal suppression on harmful prompts outside the training set—rests on aggregate ASR numbers without reported ablations for prompt-set disjointness, diversity controls, or similarity metrics between the 10 benign pairs and the harmful test prompts. This is load-bearing for the jailbreak interpretation.
  2. [Abstract / Evaluation] Abstract and Evaluation: no SFT baseline is reported on the identical 10 benign pairs, so it is impossible to determine whether the observed refusal suppression is specific to the DPO objective or would arise from any fine-tuning that favors helpful over refusal responses.
  3. [Abstract] Abstract: the mechanistic explanation for transfer is absent; the manuscript provides no analysis of refusal logit shifts, hidden-state changes, or per-prompt refusal rates on held-out harmful versus benign prompts to rule out narrow topic-specific effects or capability degradation.
minor comments (2)
  1. [Abstract] The abstract states minimum data scale accepted by OpenAI but does not clarify whether the 10-pair experiments exactly match the service's current minimum or any additional constraints.
  2. [Abstract] Costs are reported to two decimal places ($1.7, $0.3, $0.1); confirming the exact token counts or API pricing used would aid reproducibility.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the thoughtful and detailed review. The comments highlight important aspects of our evaluation that can be strengthened. We address each major comment below and will incorporate the requested clarifications and additional experiments in the revised manuscript.

read point-by-point responses
  1. Referee: [Abstract / Evaluation] Abstract and Evaluation: the central transfer claim—that DPO on 10 benign pairs produces broad refusal suppression on harmful prompts outside the training set—rests on aggregate ASR numbers without reported ablations for prompt-set disjointness, diversity controls, or similarity metrics between the 10 benign pairs and the harmful test prompts. This is load-bearing for the jailbreak interpretation.

    Authors: We agree that explicit controls for prompt disjointness and similarity are necessary to support the broad-transfer interpretation. In the revised manuscript we will add: (1) sentence-embedding cosine similarity statistics between the 10 benign training prompts and all harmful test prompts, (2) results on a strictly disjoint held-out harmful prompt set constructed to have no topical overlap with the benign pairs, and (3) diversity metrics (e.g., pairwise prompt similarity within the benign set). These additions will be reported both in the main text and in a new appendix table. revision: yes

  2. Referee: [Abstract / Evaluation] Abstract and Evaluation: no SFT baseline is reported on the identical 10 benign pairs, so it is impossible to determine whether the observed refusal suppression is specific to the DPO objective or would arise from any fine-tuning that favors helpful over refusal responses.

    Authors: The referee correctly notes the absence of an SFT control on the same data. We will add this baseline in the revised version: the identical 10 preferred (helpful) responses will be used to perform SFT on the same model checkpoints, and ASR on the harmful test set will be reported side-by-side with the DPO results. This will allow readers to assess whether the preference-based objective contributes additional suppression beyond standard supervised fine-tuning. revision: yes

  3. Referee: [Abstract] Abstract: the mechanistic explanation for transfer is absent; the manuscript provides no analysis of refusal logit shifts, hidden-state changes, or per-prompt refusal rates on held-out harmful versus benign prompts to rule out narrow topic-specific effects or capability degradation.

    Authors: We acknowledge that mechanistic analysis would strengthen the paper. Because the primary experiments use closed OpenAI models accessed only via the fine-tuning API, direct logit or hidden-state inspection is not possible. For the open-weight models (where we already demonstrate the effect with a single pair), we will add: (i) per-prompt refusal-rate tables on held-out harmful versus benign prompts and (ii) refusal-logit shift measurements before and after the single-pair DPO update. These results will be included in a new subsection to help rule out narrow topic-specific or capability-degradation explanations. revision: partial

Circularity Check

0 steps flagged

No circularity: empirical attack success rates are direct measurements, not derived quantities

full rationale

The paper's central result consists of running DPO fine-tuning on 10 (or fewer) benign preference pairs via public APIs and then measuring attack success rate on a separate set of harmful prompts. No equations, fitted parameters, or self-referential derivations are present; the reported percentages (59.13% on GPT-4o, etc.) are observed outcomes on real models rather than quantities that reduce to the input pairs by construction. Self-citations, if any, are not load-bearing for the generalization claim, which rests on the experimental protocol itself. The work is therefore self-contained against external benchmarks and receives the default non-circularity finding.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The claim rests on the empirical generalization of DPO preference optimization from benign to harmful prompts; no free parameters are fitted in the reported attack, no new entities are postulated, and the key assumption is a domain-level expectation about preference learning.

axioms (1)
  • domain assumption DPO on benign preference pairs (helpful preferred over refusal) generalizes to suppress refusals on unseen harmful prompts
    Invoked to explain why the attack succeeds on prompts outside the fine-tuning data.

pith-pipeline@v0.9.0 · 5620 in / 1222 out tokens · 64644 ms · 2026-05-13T07:02:08.598021+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Lean theorems connected to this paper

Citations machine-checked in the Pith Canon. Every link opens the source theorem in the public Lean library.

What do these tags mean?
matches
The paper's claim is directly supported by a theorem in the formal canon.
supports
The theorem supports part of the paper's argument, but the paper may add assumptions or extra steps.
extends
The paper goes beyond the formal theorem; the theorem is a base layer rather than the whole result.
uses
The paper appears to rely on the theorem as machinery.
contradicts
The paper's claim conflicts with a theorem or certificate in the canon.
unclear
Pith found a possible connection, but the passage is too broad, indirect, or ambiguous to say the theorem truly supports the claim.

Reference graph

Works this paper leans on

59 extracted references · 59 canonical work pages · 11 internal anchors

  1. [1]

    NeurIPS , year=

    Direct preference optimization: Your language model is secretly a reward model , author=. NeurIPS , year=

  2. [2]

    IEEE SaTML , year=

    Jailbreaking black box large language models in twenty queries , author=. IEEE SaTML , year=

  3. [3]

    NeurIPS , year=

    Tree of attacks: Jailbreaking black-box llms automatically , author=. NeurIPS , year=

  4. [4]

    NAACL , year=

    A wolf in sheep’s clothing: Generalized nested jailbreak prompts can fool large language models easily , author=. NAACL , year=

  5. [5]

    Low-resource languages jailbreak gpt-4

    Low-resource languages jailbreak gpt-4 , author=. arXiv preprint arXiv:2310.02446 , year=

  6. [6]

    arXiv preprint arXiv:2312.12321 , year=

    Bypassing the safety training of open-source llms with priming attacks , author=. arXiv preprint arXiv:2312.12321 , year=

  7. [7]

    ACM MM , year=

    White-box multimodal jailbreaks against large vision-language models , author=. ACM MM , year=

  8. [8]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Universal and transferable adversarial attacks on aligned language models , author=. arXiv preprint arXiv:2307.15043 , year=

  9. [9]

    COLING , year=

    Exploiting the index gradients for optimization-based jailbreaking on large language models , author=. COLING , year=

  10. [10]

    Prompt Injection attack against LLM-integrated Applications

    Prompt injection attack against llm-integrated applications , author=. arXiv preprint arXiv:2306.05499 , year=

  11. [11]

    Sleeper Agents: Training Deceptive LLMs that Persist Through Safety Training

    Sleeper agents: Training deceptive llms that persist through safety training , author=. arXiv preprint arXiv:2401.05566 , year=

  12. [12]

    IEEE S&P , year=

    Poisoning web-scale training datasets is practical , author=. IEEE S&P , year=

  13. [13]

    USENIX Security , year=

    Extracting training data from large language models , author=. USENIX Security , year=

  14. [14]

    ICLR , year=

    Fine-tuning aligned language models compromises safety, even when users do not intend to! , author=. ICLR , year=

  15. [15]

    NeurIPS Safe Generative AI Workshop 2024 , year=

    The effect of fine-tuning on language model toxicity , author=. NeurIPS Safe Generative AI Workshop 2024 , year=

  16. [16]

    Pet- zold, William Yang Wang, Xun Zhao, and Dahua Lin

    Shadow alignment: The ease of subverting safely-aligned language models , author=. arXiv preprint arXiv:2310.02949 , year=

  17. [17]

    ACL Findings , year=

    On the vulnerability of safety alignment in open-access llms , author=. ACL Findings , year=

  18. [18]

    NAACL , year=

    Removing rlhf protections in gpt-4 via fine-tuning , author=. NAACL , year=

  19. [19]

    Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b

    Lora fine-tuning efficiently undoes safety training in llama 2-chat 70b , author=. arXiv preprint arXiv:2310.20624 , year=

  20. [20]

    arXiv preprint arXiv:2406.20053 , year=

    Covert malicious finetuning: Challenges in safeguarding llm adaptation , author=. arXiv preprint arXiv:2406.20053 , year=

  21. [21]

    Invisible Safety Threat: Malicious Finetuning for

    Guangnian Wan and Xinyin Ma and Gongfan Fang and Xinchao Wang , booktitle=. Invisible Safety Threat: Malicious Finetuning for

  22. [22]

    NeurIPS , year=

    Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms , author=. NeurIPS , year=

  23. [23]

    COLM , year=

    What is in Your Safe Data? Identifying Benign Data that Breaks Safety , author=. COLM , year=

  24. [24]

    ICML , year=

    Benign Samples Matter! Fine-tuning On Outlier Benign Samples Severely Breaks Safety , author=. ICML , year=

  25. [25]

    NeurIPS , year=

    Attack via Overfitting: 10-shot Benign Fine-tuning to Jailbreak LLMs , author=. NeurIPS , year=

  26. [26]

    ICLR , year=

    No, of Course I Can! Deeper Fine-Tuning Attacks That Bypass Token-Level Safety Mechanisms , author=. ICLR , year=

  27. [27]

    Model Optimization , year =

  28. [28]

    Direct preference optimization , year =

  29. [29]

    ICLR , year=

    Self-Destructive Language Models , author=. ICLR , year=

  30. [30]

    Safety layers in aligned large language models: The key to llm security.arXiv preprint arXiv:2408.17003, 2024

    Safety layers in aligned large language models: The key to llm security , author=. arXiv preprint arXiv:2408.17003 , year=

  31. [31]

    NeurIPS , year=

    Jailbroken: How does llm safety training fail? , author=. NeurIPS , year=

  32. [32]

    OpenAI GPT-5 System Card

    Openai gpt-5 system card , author=. arXiv preprint arXiv:2601.03267 , year=

  33. [33]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Llama guard: Llm-based input-output safeguard for human-ai conversations , author=. arXiv preprint arXiv:2312.06674 , year=

  34. [34]

    AAAI , year=

    Scaling trends for data poisoning in llms , author=. AAAI , year=

  35. [35]

    NeurIPS , year=

    Mogu: A framework for enhancing safety of llms while preserving their usability , author=. NeurIPS , year=

  36. [36]

    Harmful Fine-tuning Attacks and Defenses for Large Language Models: A Survey

    Harmful fine-tuning attacks and defenses for large language models: A survey , author=. arXiv preprint arXiv:2409.18169 , year=

  37. [37]

    The Llama 3 Herd of Models

    The llama 3 herd of models , author=. arXiv preprint arXiv:2407.21783 , year=

  38. [38]

    Qwen2.5 Technical Report

    Qwen2.5 Technical Report , author =. arXiv preprint arXiv:2412.15115 , year =

  39. [39]

    ICLR , year=

    Safety Alignment Should be Made More Than Just a Few Tokens Deep , author=. ICLR , year=

  40. [40]

    NAACL , year=

    Xstest: A test suite for identifying exaggerated safety behaviours in large language models , author=. NAACL , year=

  41. [41]

    Justin Cui and Wei-Lin Chiang and Ion Stoica and Cho-Jui Hsieh , booktitle=

  42. [42]

    , author=

    Lora: Low-rank adaptation of large language models. , author=. ICLR , year=

  43. [43]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Harmbench: A standardized evaluation framework for automated red teaming and robust refusal , author=. arXiv preprint arXiv:2402.04249 , year=

  44. [44]

    Zhao, J., Huang, J., Wu, Z., Bau, D., and Shi, W

    Sorry-bench: Systematically evaluating large language model safety refusal , author=. arXiv preprint arXiv:2406.14598 , year=

  45. [45]

    NeurIPS , year=

    A strongreject for empty jailbreaks , author=. NeurIPS , year=

  46. [46]

    NeurIPS , year=

    Jailbreakbench: An open robustness benchmark for jailbreaking large language models , author=. NeurIPS , year=

  47. [47]

    2025 , url =

    Gemini 3 Pro Model Card , author =. 2025 , url =

  48. [48]

    2025 , url =

    Claude Sonnet 4.5 System Card , author =. 2025 , url =

  49. [49]

    JMLR , year=

    Scaling instruction-finetuned language models , author=. JMLR , year=

  50. [50]

    NAACL Findings , year=

    LawInstruct: A resource for studying language model adaptation to the legal domain , author=. NAACL Findings , year=

  51. [51]

    Nature medicine , year=

    Toward expert-level medical question answering with large language models , author=. Nature medicine , year=

  52. [52]

    2024 , howpublished =

    David Hershey , title =. 2024 , howpublished =

  53. [53]

    ICWSM , year=

    Watch your language: Investigating content moderation with large language models , author=. ICWSM , year=

  54. [54]

    arXiv preprint arXiv:2502.16776 , year=

    AISafetyLab: A Comprehensive Framework for AI Safety Evaluation and Improvement , author=. arXiv preprint arXiv:2502.16776 , year=

  55. [55]

    Training Verifiers to Solve Math Word Problems

    Training Verifiers to Solve Math Word Problems , author=. arXiv preprint arXiv:2110.14168 , year=

  56. [56]

    Instruction-Following Evaluation for Large Language Models

    Instruction-following evaluation for large language models , author=. arXiv preprint arXiv:2311.07911 , year=

  57. [57]

    COLM , year=

    Gpqa: A graduate-level google-proof q&a benchmark , author=. COLM , year=

  58. [58]

    Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning

    Lei Hsiung and Tianyu Pang and Yung-Chen Tang and Linyue Song and Tsung-Yi Ho and Pin-Yu Chen and Yaoqing Yang , year=. Your Task May Vary: A Systematic Understanding of Alignment and Safety Degradation when Fine-tuning

  59. [59]

    ICLR , year=

    Rethinking Benign Relearning: Syntax as the Hidden Driver of Unlearning Failures , author=. ICLR , year=