BPE tokenization creates exploitable gaps in LLM safety by fragmenting safety words, enabling attacks that flip refusal on 80-100% of HarmBench prompts across five models, with DPO failing to close the gap stably and SFT causing over-refusal.
S2C : Split-and-Combine Jailbreak Attacks
2 Pith papers cite this work. Polarity classification is still indexing.
2
Pith papers citing it
years
2026 2verdicts
UNVERDICTED 2representative citing papers
Cybersecurity's scale, adversaries, labeling issues, and operational demands make it the superior test-case for general AI progress over NLP or computer vision.
citing papers explorer
-
Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment
BPE tokenization creates exploitable gaps in LLM safety by fragmenting safety words, enabling attacks that flip refusal on 80-100% of HarmBench prompts across five models, with DPO failing to close the gap stably and SFT causing over-refusal.