Shared-embedding sequence models cannot achieve Semantic-Faithful Control over control-authoritative actions due to provenance-recovery impossibility, control-path exposure, and finite-coverage invariance gap.
arXiv preprint arXiv:2505.19056 , year =
5 Pith papers cite this work. Polarity classification is still indexing.
years
2026 5representative citing papers
CHASE uses co-evolutionary RL with GRPO to harden LLMs against black-box prompt-rewriting attacks, cutting mean StrongREJECT scores by 43.2% on held-out families while keeping zero false refusals on benign prompts.
Abliteration and prefilling attacks raise harm success rates on safeguarded open-weight LLMs from below 10% to 16-96% across three benchmarks, and a new ART tuning method reduces those rates by 10-20%.
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.
English-only safety alignment fails to transfer cross-lingually, while multilingual DPO training on the new RefusEU dataset improves safety across 12 European languages without degrading Global MMLU performance.
citing papers explorer
-
Open-Weight LLM Fine-Tuning Defenses are Susceptible to Simple Attacks
Abliteration and prefilling attacks raise harm success rates on safeguarded open-weight LLMs from below 10% to 16-96% across three benchmarks, and a new ART tuning method reduces those rates by 10-20%.