On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
Jailbreaking leading safety-aligned LLMs with simple adaptive attacks
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2representative citing papers
Introduces Controlled Latent-space Evasion attack that projects model activations past a linear probe's decision boundary to suppress refusal, outperforming ablation baselines on 15 models.
citing papers explorer
-
On-Policy Consistency Training Improves LLM Safety with Minimal Capability Degradation
On-Policy Consistency Training (OPCT) improves LLM safety metrics over supervised fine-tuning while largely preserving capabilities across three model families.
-
Latent-space Attacks for Refusal Evasion in Language Models
Introduces Controlled Latent-space Evasion attack that projects model activations past a linear probe's decision boundary to suppress refusal, outperforming ablation baselines on 15 models.