Shared-embedding sequence models cannot achieve Semantic-Faithful Control over control-authoritative actions due to provenance-recovery impossibility, control-path exposure, and finite-coverage invariance gap.
Abu Shairah, H
3 Pith papers cite this work. Polarity classification is still indexing.
years
2026 3representative citing papers
Abliteration and prefilling attacks raise harm success rates on safeguarded open-weight LLMs from below 10% to 16-96% across three benchmarks, and a new ART tuning method reduces those rates by 10-20%.
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.
citing papers explorer
-
On the Inseparability of Instructions and Data in Shared-Embedding Sequence Models
Shared-embedding sequence models cannot achieve Semantic-Faithful Control over control-authoritative actions due to provenance-recovery impossibility, control-path exposure, and finite-coverage invariance gap.
-
Ablating Safety: Mechanisms for Removing Alignment in Language Models for Security Applications
Empirical comparison of alignment ablation methods on a 60-prompt security evaluation suite shows task-only LoRA achieves 0.87 mean security score with 0.13 unsafe compliance.