Domain contexts blur LLM safety boundaries, enabling the Jargon attack framework to exceed 93% success on seven frontier models via safety-research contexts and multi-turn interactions, with a policy-guided mitigation.
InProceedings of the 62nd Annual Meeting of the Association for Computational Linguistics (Volume 1: Long Papers), pages 10542– 10560
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CR 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Into the Gray Zone: Domain Contexts Can Blur LLM Safety Boundaries
Domain contexts blur LLM safety boundaries, enabling the Jargon attack framework to exceed 93% success on seven frontier models via safety-research contexts and multi-turn interactions, with a policy-guided mitigation.