On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
Jailbreaking leading safety-aligned LLMs with simple adaptive attacks
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Refusal suppression via difference-in-means ablation equals projection onto a linear probe's decision boundary, and a controlled evasion attack optimizing confidence past the boundary achieves SOTA success rates on 15 models.
citing papers explorer
No citing papers match the current filters.