arxiv: 2605.03226 · v1 · submitted 2026-05-04 · 💻 cs.LG · cs.AI· cs.CR

Recognition: 2 theorem links

· Lean Theorem

Self-Mined Hardness for Safety Fine-Tuning

Prakhar Gupta , Garv Shah , Donghua Zhang

Authors on Pith no claims yet

Pith reviewed 2026-05-08 18:17 UTC · model grok-4.3

classification 💻 cs.LG cs.AIcs.CR

keywords safety fine-tuningjailbreak attacksadversarial promptslanguage model alignmenthard example miningover-refusal

0 comments

The pith

Models can improve safety fine-tuning by selecting prompts based on how often their own responses are judged harmful.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a method to rank training prompts for safety alignment by measuring the frequency with which the target model's own generations on those prompts are labeled harmful. It then fine-tunes the model on the top-ranked hard prompts paired with the model's own non-harmful responses. This self-selected data reduces jailbreak attack success rates substantially on Llama models. At the same time it increases refusals on benign prompts that mimic jailbreak phrasing, a side effect that can be partially offset by interleaving adversarially phrased but safe examples. Training only on the hardest half of the selected pool further improves attack resistance within the mixed setting.

Core claim

Scoring each candidate prompt by the share of the model's own rollouts judged harmful, then fine-tuning on the hardest prompts together with the model's safe responses, lowers WildJailbreak attack success from 11.5 percent and 20.1 percent to 1-3 percent on the 8B and 3B Llama-3 variants while the addition of 1:1 adversarially framed benign prompts reduces over-refusal from 74-94 percent back to 30-72 percent at only a 2-6 point cost in attack success.

What carries the argument

self-mined hardness, the fraction of a model's own outputs on a prompt that an external judge labels harmful, used to rank and select training examples

Load-bearing premise

The frequency with which the model's own answers are judged harmful is a reliable indicator of how useful that prompt will be for safety fine-tuning.

What would settle it

A controlled test in which prompts ranked high by self-mined hardness do not produce higher rates of harmful model responses than low-ranked prompts would falsify the selection criterion.

Figures

Figures reproduced from arXiv: 2605.03226 by Donghua Zhang, Garv Shah, Prakhar Gupta.

**Figure 1.** Figure 1: Headline test-set results. Lower is better in every panel. Each column is internally compute-matched: left (a, c) = pure regime (top 50% of each model’s eligible pool). Right (b, d) = mixed regime (the same hardest prompts interleaved 1:1 with an equal-sized adv-benign set, so twice the training prompts). The Control bar in each subplot is the matched-compute reference for that regime: a different control … view at source ↗

read the original abstract

Safety fine-tuning of language models typically requires a curated adversarial dataset. We take a different approach: score each candidate prompt's difficulty by how often the target model's own rollouts are judged harmful, then fine-tune on the hardest prompts paired with the model's own non-jailbroken rollouts. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct, this approach cuts the WildJailbreak attack success rate from 11.5% and 20.1% down to 1-3%, but pushes refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the same hard prompts 1:1 with adversarially-framed benign prompts (prompts that look like jailbreaks but have benign intent) cuts that refusal back down to 30-51% on 8B and 52-72% on 3B, at a cost of 2-6 percentage points of attack success rate. Within the mixed regime, training on the hardest half of the eligible pool rather than a random half cuts the remaining ASR by 35-50% (about 3 percentage points) on both models.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The self-mined hardness method delivers clear ASR drops on WildJailbreak for the Llama models but the external judge is unexamined and over-refusal stays high even after the mixing fix.

read the letter

The main takeaway is that scoring prompts by how often the target model’s own rollouts get labeled harmful, then training on the hardest half paired with safe self-responses, cuts attack success rate from 11.5% to 1-3% on Llama-3-8B-Instruct and from 20.1% to 1-3% on the 3B version. Adding a 1:1 mix with adversarially framed benign prompts brings refusal on jailbreak-shaped benign queries down from 74-94% to 30-51% on 8B (and 52-72% on 3B), at the price of 2-6 points higher ASR. Picking the hardest half rather than a random half inside that mixed regime gives another 3-point ASR edge on both sizes. That combination is the concrete new piece: a lightweight way to generate hard safety data without large curated adversarial sets, plus a simple interleaving step to limit collateral refusal. The numbers are specific enough to be worth testing if you work on open-model safety fine-tuning. The soft spots sit in the evaluation. The abstract gives no information on the harm judge, number of rollouts per prompt, temperature, or any calibration against human labels, so it is unclear whether the hardness scores reflect genuine model difficulty or just the judge’s own patterns. The remaining refusal rates after mixing are still high enough to affect usability, and there is no mention of run-to-run variance or statistical tests. This is aimed at practitioners who fine-tune LLMs for safety and want cheap data-generation tricks. A reader who already runs similar experiments on Llama-scale models could try the hardness selection and mixing recipe directly. It deserves a serious referee because the empirical claims are straightforward to replicate and the benchmark results are large enough to matter, even though the methods section will need to address the judge and controls before the gains can be trusted.

Referee Report

3 major / 2 minor

Summary. The paper proposes a self-mining approach to safety fine-tuning in which candidate prompts are scored by the fraction of the target model's own rollouts that an external judge labels harmful; the model is then fine-tuned on the hardest prompts paired with its own safe responses. On Llama-3-8B-Instruct and Llama-3.2-3B-Instruct the method reduces WildJailbreak attack success rate from 11.5% and 20.1% to 1-3%, while raising over-refusal on jailbreak-shaped benign prompts from 14-22% to 74-94%. Interleaving the hard prompts 1:1 with adversarially-framed benign prompts lowers over-refusal to 30-51% (8B) and 52-72% (3B) at the cost of 2-6 percentage points in ASR; within this mixed regime, training on the hardest half of the pool rather than a random half further cuts remaining ASR by 35-50% (approximately 3 pp).

Significance. If the hardness signal proves robust, the approach supplies a scalable, largely automated route to high-difficulty safety data that does not require large-scale manual curation of adversarial examples. The concrete ASR reductions and the practical interleaving mitigation are potentially useful for production alignment pipelines. The additional gain from selecting the hardest subset within the mixed regime is a modest but falsifiable empirical finding that could be replicated on other models and benchmarks.

major comments (3)

[Abstract] Abstract: the central claim that self-mined hardness yields the reported 11.5%→1-3% and 20.1%→1-3% ASR reductions rests on the external harm judgment used to score prompts. The manuscript supplies no information on judge identity, number of rollouts per prompt, sampling temperature, or any calibration against human labels, leaving open the possibility that the selected set simply triggers the judge's particular refusal heuristics rather than reflecting genuine prompt difficulty.
[Abstract] Abstract: no statistical significance, standard deviations across runs, or exact training hyperparameters (learning rate, epochs, batch size) are reported for any of the fine-tuning experiments. Without these, it is impossible to determine whether the 1-3% ASR figures or the 35-50% relative improvement from the hardest-half selection are reliable or could be explained by run-to-run variance.
[Abstract] Abstract: the construction of the 'adversarially-framed benign prompts' used for interleaving is not described. Because these prompts are central to the over-refusal mitigation result (74-94% → 30-51%/52-72%), the lack of detail on how they are generated or filtered makes it difficult to assess whether the mitigation generalizes or merely trades one form of bias for another.

minor comments (2)

The abstract refers to 'WildJailbreak' without citation or a one-sentence description of the benchmark; a brief parenthetical or reference would aid readers unfamiliar with the dataset.
The phrase 'eligible pool' in the final sentence of the abstract is undefined; clarifying whether it includes only prompts that produced at least one harmful rollout, or all candidate prompts, would improve precision.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for their thorough review and valuable suggestions. We address each of the major comments below and have prepared revisions to enhance the clarity, reproducibility, and completeness of the manuscript.

read point-by-point responses

Referee: the central claim that self-mined hardness yields the reported 11.5%→1-3% and 20.1%→1-3% ASR reductions rests on the external harm judgment used to score prompts. The manuscript supplies no information on judge identity, number of rollouts per prompt, sampling temperature, or any calibration against human labels, leaving open the possibility that the selected set simply triggers the judge's particular refusal heuristics rather than reflecting genuine prompt difficulty.

Authors: We appreciate the referee highlighting the need for transparency regarding the external harm judgment process. We will revise the manuscript to include detailed information on the judge model used, the number of rollouts per prompt, the sampling temperature, and any available calibration details against human labels. This addition will allow readers to better assess whether the hardness scoring reflects genuine difficulty or judge-specific behaviors. revision: yes
Referee: no statistical significance, standard deviations across runs, or exact training hyperparameters (learning rate, epochs, batch size) are reported for any of the fine-tuning experiments. Without these, it is impossible to determine whether the 1-3% ASR figures or the 35-50% relative improvement from the hardest-half selection are reliable or could be explained by run-to-run variance.

Authors: We agree that the absence of statistical significance testing, standard deviations, and specific hyperparameters limits the interpretability of the results. In the revised version, we will report results with standard deviations across multiple independent runs, include p-values for key comparisons, and provide the complete set of training hyperparameters including learning rate, number of epochs, and batch size. These revisions will strengthen the reliability claims. revision: yes
Referee: the construction of the 'adversarially-framed benign prompts' used for interleaving is not described. Because these prompts are central to the over-refusal mitigation result (74-94% → 30-51%/52-72%), the lack of detail on how they are generated or filtered makes it difficult to assess whether the mitigation generalizes or merely trades one form of bias for another.

Authors: We acknowledge that the generation process for the adversarially-framed benign prompts requires more elaboration. We will expand the methods section to describe how these prompts were constructed—by reformulating benign queries into jailbreak-style formats while preserving benign intent—and the filtering steps applied to ensure they do not trigger refusals. Examples will also be provided to facilitate assessment of generalizability. revision: yes

Circularity Check

0 steps flagged

No significant circularity; purely empirical selection and fine-tuning

full rationale

The paper describes an empirical pipeline: score candidate prompts by the fraction of the target model's own rollouts labeled harmful by an external judge, then fine-tune on the highest-scoring prompts paired with safe responses. No equations, derivations, or parameter-fitting steps are presented that could reduce any claimed result to its own inputs by construction. The central claims rest on reported attack-success-rate reductions and over-refusal measurements from concrete experiments on Llama-3 models, not on any self-definitional loop or self-citation chain. The reader's assessment of zero circularity is therefore confirmed.

Axiom & Free-Parameter Ledger

0 free parameters · 0 axioms · 0 invented entities

The approach rests on standard supervised fine-tuning assumptions and an external harm judgment oracle; no free parameters, axioms, or invented entities are introduced beyond ordinary training hyperparameters.

pith-pipeline@v0.9.0 · 5516 in / 1288 out tokens · 46135 ms · 2026-05-08T18:17:53.452364+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

15 extracted references · 9 canonical work pages · 3 internal anchors

[1]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Ouyang, Long and Wu, Jeffrey and Jiang, Xu and Almeida, Diogo and Wainwright, Carroll and Mishkin, Pamela and Zhang, Chong and Agarwal, Sandhini and Slama, Katarina and Ray, Alex and others , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[2]

Constitutional AI: Harmlessness from AI Feedback

Bai, Yuntao and Kadavath, Saurav and Kundu, Sandipan and Askell, Amanda and Kernion, Jackson and Jones, Andy and Chen, Anna and Goldie, Anna and Mirhoseini, Azalia and McKinnon, Cameron and others , title =. arXiv preprint arXiv:2212.08073 , year =

work page internal anchor Pith review arXiv
[3]

The Llama 3 Herd of Models

Dubey, Abhimanyu and Jauhri, Abhinav and Pandey, Abhinav and Kadian, Abhishek and Al-Dahle, Ahmad and Letman, Aiesha and Mathur, Akhil and Schelten, Alan and Yang, Amy and Fan, Angela and others , title =. arXiv preprint arXiv:2407.21783 , year =

work page internal anchor Pith review arXiv
[4]

Advances in Neural Information Processing Systems (NeurIPS) , year =

Wei, Alexander and Haghtalab, Nika and Steinhardt, Jacob , title =. Advances in Neural Information Processing Systems (NeurIPS) , year =
[5]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, Andy and Wang, Zifan and Kolter, J. Zico and Fredrikson, Matt , title =. arXiv preprint arXiv:2307.15043 , year =

work page Pith review arXiv
[6]

doi:10.48550/arXiv.2406.18510 , abstract =

Jiang, Liwei and Rao, Kavel and Han, Seungju and Ettinger, Allyson and Brahman, Faeze and Kumar, Sachin and Mireshghallah, Niloofar and Lu, Ximing and Sap, Maarten and Choi, Yejin and Dziri, Nouha , title =. arXiv preprint arXiv:2406.18510 , year =

work page arXiv
[7]

Wildguard: Open one-stop moderation tools for safety risks, jailbreaks, and refusals of llms.arXiv preprint arXiv:2406.18495, 2024

Han, Seungju and Rao, Kavel and Ettinger, Allyson and Jiang, Liwei and Lin, Bill Yuchen and Dziri, Nouha and Choi, Yejin and Sap, Maarten , title =. arXiv preprint arXiv:2406.18495 , year =

work page arXiv
[8]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, Hakan and Upasani, Kartikeya and Chi, Jianfeng and Rungta, Rashi and Iyer, Krithika and Mao, Yuning and Tontchev, Michael and Hu, Qing and Fuller, Brian and Testuggine, Davide and Khabsa, Madian , title =. arXiv preprint arXiv:2312.06674 , year =

work page internal anchor Pith review arXiv
[9]

arXiv:2402.05044 (2024), https://arxiv.org/abs/2402.05044

Li, Lijun and Dong, Bowen and Wang, Ruohui and Hu, Xuhao and Zuo, Wangmeng and Lin, Dahua and Qiao, Yu and Shao, Jing , title =. arXiv preprint arXiv:2402.05044 , year =

work page arXiv
[10]

and Michael, Julian and Perez, Ethan and Treutlein, Johannes , title =

Chua, James and Rees, Edward and Batra, Hunar and Bowman, Samuel R. and Michael, Julian and Perez, Ethan and Treutlein, Johannes , title =. arXiv preprint arXiv:2403.05518 , year =

work page arXiv
[11]

and Shah, Rohin , title =

Irpan, Alex and Turner, Alexander Matt and Kurzeja, Mark and Elson, David K. and Shah, Rohin , title =. arXiv preprint arXiv:2510.27062 , year =

work page arXiv
[12]

Curriculum Learning , booktitle =

Bengio, Yoshua and Louradour, J. Curriculum Learning , booktitle =. 2009 , pages =

2009
[13]

Proceedings of the

Shrivastava, Abhinav and Gupta, Abhinav and Girshick, Ross , title =. Proceedings of the. 2016 , pages =

2016
[14]

and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =

Hu, Edward J. and Shen, Yelong and Wallis, Phillip and Allen-Zhu, Zeyuan and Li, Yuanzhi and Wang, Shean and Wang, Lu and Chen, Weizhu , title =. International Conference on Learning Representations (
[15]

Proceedings of the 29th Symposium on Operating Systems Principles (

Kwon, Woosuk and Li, Zhuohan and Zhuang, Siyuan and Sheng, Ying and Zheng, Lianmin and Yu, Cody Hao and Gonzalez, Joseph and Zhang, Hao and Stoica, Ion , title =. Proceedings of the 29th Symposium on Operating Systems Principles (