Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially mitigates.
and Michael, Julian and Perez, Ethan and Treutlein, Johannes , title =
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
fields
cs.LG 2years
2026 2representative citing papers
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
citing papers explorer
-
Self-Mined Hardness for Safety Fine-Tuning
Self-mined hardness from model rollouts reduces WildJailbreak attack success rates to 1-3% on Llama models but increases over-refusal on benign prompts, which mixing with adversarially-framed benign prompts partially mitigates.
-
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.