BPE tokenization creates exploitable gaps in LLM safety by fragmenting safety words, enabling attacks that flip refusal on 80-100% of HarmBench prompts across five models, with DPO failing to close the gap stably and SFT causing over-refusal.
LBPE : Long-token-first Tokenization to Improve Large Language Models
1 Pith paper cite this work. Polarity classification is still indexing.
1
Pith paper citing it
fields
cs.CL 1years
2026 1verdicts
UNVERDICTED 1representative citing papers
citing papers explorer
-
Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment
BPE tokenization creates exploitable gaps in LLM safety by fragmenting safety words, enabling attacks that flip refusal on 80-100% of HarmBench prompts across five models, with DPO failing to close the gap stably and SFT causing over-refusal.