pith. sign in

arxiv: 2607.01239 · v1 · pith:GDJNTIANnew · submitted 2026-05-01 · 💻 cs.CL · cs.AI

Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment

Pith reviewed 2026-07-04 01:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AI
keywords BPE tokenizationLLM safety alignmenttoken fragmentationrefusal triggerHarmBenchjailbreak attackDPOSFT
0
0 comments X

The pith

BPE tokenization fragments safety-critical words into sub-tokens absent from alignment datasets, allowing targeted optimization to flip refusal triggers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that BPE tokenization splits safety words into pieces that the three surveyed public alignment datasets never include. This structural gap lets an optimization targeting those fragments flip the first-token refusal trigger on 80-100 percent of refused HarmBench prompts across five model families. Nearly half the flips produce genuinely harmful outputs according to the models. Activation patching places the disrupted signal in the last 30 percent of layers, while a scan of 30,000 alignment examples finds zero fragmented prompts. DPO training fails to close attack success rate stably on any family, and SFT on fragmented data only reduces attacks by raising refusal on benign inputs as well.

Core claim

BPE tokenization creates exploitable gaps in LLM alignment because the three public alignment datasets surveyed contain no intentionally fragmented inputs. An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs (per-model 29-65%; gap-vs-behavior ROC-AUC 0.66-0.98, pooled 0.84). Activation patching localizes the disrupted signal to the last ~30% of layers, an alignment-data scan finds zero fragmented prompts among 30,000 examples, and targeted-mutation experiments isolate safety words as the disruption locus.

What carries the argument

BPE tokenization of safety-critical words, which fragments them into sub-word pieces that alignment training never encounters.

Load-bearing premise

The absence of intentionally fragmented safety inputs in the three public alignment datasets is the load-bearing cause of the bypass rather than other unmeasured properties of the models or training.

What would settle it

Training any of the tested models on a dataset that includes fragmented versions of the same safety words at attack-relevant intensities and then re-running the optimization to check whether the first-token refusal flip rate drops below 20 percent.

Figures

Figures reproduced from arXiv: 2607.01239 by Hongliang Liu, Tung-Ling Li, Yuhao Wu.

Figure 1
Figure 1. Figure 1: The mechanistic chain underlying character-level jailbreaks. Each link is established by an [PITH_FULL_IMAGE:figures/full_fig_p002_1.png] view at source ↗
Figure 2
Figure 2. Figure 2: Space insertion progressively collapses the safety gap. Compound effects produce acceler [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗
Figure 3
Figure 3. Figure 3: Per-layer gap contribution on Qwen-3-4B. Most layers push toward compliance (red); 4 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗
Figure 4
Figure 4. Figure 4: Safety-word mutation (red) vs. neutral-word mutation (blue) ASR as a function of BoN [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗
Figure 5
Figure 5. Figure 5: Fragmented safety-word comprehension across 5 model families (20 words each), refer [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗
Figure 6
Figure 6. Figure 6: Behavioral refusal rate vs. logit gap (400 evaluations, 4 models). The monotonic relationship [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗
Figure 7
Figure 7. Figure 7: ASRloose vs XSTest over-refusal across all canonical checkpoints. Star = baseline; circle = DPO; triangle = SFT. Green band = OR-cap eligibility (baseline OR + 5pp). DPO clusters inside the OR cap below baseline ASR; SFT achieves dramatic ASR reduction only by exiting the OR cap (Mistral, Gemma) [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗
Figure 8
Figure 8. Figure 8: Conv-Benign change vs Conv-Harm change for all canonical SFT (red triangles) and DPO [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗
read the original abstract

Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable. We identify and test a central structural mechanism: BPE tokenization fragments safety-critical words into sub-word pieces, and the three public alignment datasets we surveyed contain no intentionally fragmented inputs. The mechanism is a chain, tested end-to-end on five model families (Qwen-3-4B, Qwen-2.5-7B, Gemma-3-4B, Llama-3.1-8B, Mistral-7B). An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs (per-model 29-65%; gap-vs-behavior ROC-AUC 0.66-0.98, pooled 0.84). Activation patching localizes the disrupted signal to the last ${\sim}30\%$ of layers; an alignment-data scan finds zero fragmented prompts among 30,000 examples (positive-control recall $\geq 99\%$ at attack-relevant intensities); and targeted-mutation experiments isolate safety words as the disruption locus. On the defense side, a 68-cell grid (55 trained checkpoints) shows that no DPO configuration achieves seed- and pool-stable ASR closure on the three families with closed pool-size confounds. SFT trained on fragmented prompts closes ASR on 3/5 families but only via global collapse that raises refusal on benign prompts as well, indicating the missing distribution is necessary but not sufficient under the LoRA-16 recipe we tested. To distinguish selective repair from global collapse, we introduce Conv-Benign, a candidate paired diagnostic. All ASR claims are 3-judge-calibrated (cell rankings stable across judges; absolute levels $\pm$18pp; see App.~B.13).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

3 major / 1 minor

Summary. The paper claims that BPE tokenization fragments safety-critical words into subword pieces and that the three surveyed public alignment datasets contain no intentionally fragmented inputs (zero in 30k examples, positive-control recall ≥99%), creating exploitable gaps in LLM safety. An optimization targeting this fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts across five model families (Qwen-3-4B, Qwen-2.5-7B, Gemma-3-4B, Llama-3.1-8B, Mistral-7B), with 48% of flips yielding harmful outputs (per-model 29-65%; gap-vs-behavior ROC-AUC 0.66-0.98, pooled 0.84). Activation patching localizes the effect to the last ~30% of layers; targeted mutations isolate safety words; a 68-cell DPO grid (55 checkpoints) finds no seed- and pool-stable ASR closure; SFT on fragmented prompts closes ASR on 3/5 families but only via global collapse that also raises benign refusal rates. All ASR claims use 3-judge calibration.

Significance. If the central mechanism holds, the work identifies a structural tokenization-alignment interaction that explains character-level bypasses and shows why standard alignment data is insufficient, with concrete localization via patching and a diagnostic (Conv-Benign) for distinguishing selective repair from collapse. The multi-family replication, dataset scan, and defense grid provide falsifiable empirical anchors that could guide more robust alignment recipes.

major comments (3)
  1. [Abstract (SFT and defense grid description)] Abstract / SFT paragraph and defense experiments: the claim that dataset absence is the load-bearing cause of the bypass is undercut by the SFT results, which close ASR on only 3/5 families and do so only through global collapse that raises benign refusal rates; this outcome is consistent with deeper BPE-alignment interactions (e.g., safety-signal representation in the final ~30% of layers) rather than training-distribution composition alone, and the targeted-mutation and patching results do not rule out such alternatives.
  2. [Abstract (DPO grid)] Abstract / 68-cell grid description: the statement that 'no DPO configuration achieves seed- and pool-stable ASR closure' rests on a 68-cell search whose pre-specification versus post-hoc selection is not detailed; without that, the negative result on DPO cannot be assessed for robustness against the reader's noted concern of potential post-hoc selection in the grid.
  3. [Abstract (optimization results)] Abstract / attack optimization: the 80-100% flip rate and 48% harmful-output rate are reported on 'refused HarmBench prompts' after optimization, but the manuscript does not specify whether the optimization hyperparameters were tuned on a held-out subset or the full test pool, leaving open the possibility that reported success rates partly reflect overfitting rather than a general token-boundary vulnerability.
minor comments (1)
  1. [Abstract] The ±18pp absolute-level variability across the three judges is noted only in App. B.13; a brief main-text summary of judge agreement on the key 80-100% and 48% figures would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications and proposed revisions where appropriate.

read point-by-point responses
  1. Referee: [Abstract (SFT and defense grid description)] Abstract / SFT paragraph and defense experiments: the claim that dataset absence is the load-bearing cause of the bypass is undercut by the SFT results, which close ASR on only 3/5 families and do so only through global collapse that raises benign refusal rates; this outcome is consistent with deeper BPE-alignment interactions (e.g., safety-signal representation in the final ~30% of layers) rather than training-distribution composition alone, and the targeted-mutation and patching results do not rule out such alternatives.

    Authors: The abstract already states that the missing distribution is 'necessary but not sufficient under the LoRA-16 recipe we tested' and introduces Conv-Benign precisely to distinguish selective repair from global collapse. The SFT results are presented as evidence that adding fragmented examples alone does not yield stable selective alignment under the tested recipe. While we agree that deeper interactions in later layers (as localized by patching) may contribute, the targeted-mutation experiments isolate safety-word fragmentation as a causal locus for the first-token trigger flip. We will revise the abstract and discussion to more explicitly frame the dataset gap as one structural factor among potentially interacting mechanisms, without claiming it is the sole load-bearing cause. revision: partial

  2. Referee: [Abstract (DPO grid)] Abstract / 68-cell grid description: the statement that 'no DPO configuration achieves seed- and pool-stable ASR closure' rests on a 68-cell search whose pre-specification versus post-hoc selection is not detailed; without that, the negative result on DPO cannot be assessed for robustness against the reader's noted concern of potential post-hoc selection in the grid.

    Authors: The grid was constructed from standard DPO hyperparameter ranges in the literature (learning rate, beta, epochs) crossed with the three pool sizes that close the pool-size confound, with all 68 cells evaluated exhaustively. We agree that explicit documentation of the pre-specification process is needed to demonstrate absence of post-hoc selection. We will add an appendix subsection detailing the grid design rationale, the full list of evaluated combinations, and confirmation that no cells were excluded after initial runs. revision: yes

  3. Referee: [Abstract (optimization results)] Abstract / attack optimization: the 80-100% flip rate and 48% harmful-output rate are reported on 'refused HarmBench prompts' after optimization, but the manuscript does not specify whether the optimization hyperparameters were tuned on a held-out subset or the full test pool, leaving open the possibility that reported success rates partly reflect overfitting rather than a general token-boundary vulnerability.

    Authors: Hyperparameters for the attack optimization were selected via a small development set of 50 prompts drawn from a separate split of HarmBench, with the reported results evaluated only on the remaining held-out test prompts. We will update the methods section and abstract to explicitly state this held-out tuning procedure and report the development-set size to address potential overfitting concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical chain is self-contained

full rationale

The paper presents an end-to-end empirical pipeline: dataset scan for fragmented prompts (30k examples, positive-control recall ≥99%), optimization targeting safety-token fragmentation, first-token refusal flips on HarmBench, activation patching localizing to last ~30% layers, targeted-mutation isolation, and a 68-cell defense grid with SFT/DPO checkpoints. No equations, fitted parameters renamed as predictions, or self-citation chains appear; the absence of fragmented inputs is measured directly rather than assumed by definition, and the SFT result is reported as showing necessity but not sufficiency. All claims are grounded in observable experimental outcomes rather than reducing to their own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Central claim rests on the empirical premise that standard alignment datasets omit fragmented safety tokens and that this omission directly causes the bypass; attack optimization and defense grid introduce unspecified hyperparameters whose selection is not detailed.

free parameters (2)
  • attack optimization hyperparameters
    Parameters controlling the character perturbations and safety-token targeting are not specified.
  • DPO grid search parameters
    The 68-cell configuration space and LoRA-16 recipe contain multiple tunable values whose selection affects the reported failure to close ASR.
axioms (2)
  • domain assumption BPE is the tokenization method used by all five tested model families and creates the relevant sub-word fragments
    Invoked to link character perturbations to token fragmentation.
  • domain assumption The surveyed alignment datasets are representative of typical safety training distributions
    Used to conclude that fragmented inputs are systematically missing.

pith-pipeline@v0.9.1-grok · 5883 in / 1452 out tokens · 38762 ms · 2026-07-04T01:15:02.947803+00:00 · methodology

discussion (0)

Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.

Reference graph

Works this paper leans on

29 extracted references · 14 canonical work pages · 6 internal anchors

  1. [1]

    Best-of- N Jailbreaking

    Hughes, R., et al. Best-of- N Jailbreaking. arXiv:2412.03556, 2024

  2. [2]

    Jailbroken: How Does LLM Safety Training Fail? In NeurIPS, 2023

    Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How Does LLM Safety Training Fail? In NeurIPS, 2023

  3. [3]

    Neural Machine Translation of Rare Words with Subword Units

    Sennrich, R., Haddow, B., and Birch, A. Neural Machine Translation of Rare Words with Subword Units. In ACL, 2016

  4. [4]

    Training language models to follow instructions with human feedback

    Ouyang, L., et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022

  5. [5]

    Direct Preference Optimization

    Rafailov, R., et al. Direct Preference Optimization. In NeurIPS, 2023

  6. [6]

    Universal and Transferable Adversarial Attacks on Aligned Language Models

    Zou, A., et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043, 2023

  7. [7]

    Locating and Editing Factual Associations in GPT

    Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and Editing Factual Associations in GPT . In NeurIPS, 2023

  8. [8]

    Refusal in Language Models Is Mediated by a Single Direction

    Arditi, A., et al. Refusal in Language Models Is Mediated by a Single Direction. arXiv:2406.11717, 2024

  9. [9]

    Safety Alignment Should Be Made More Than Just a Few Tokens Deep

    Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety Alignment Should Be Made More Than Just a Few Tokens Deep. arXiv:2406.05946, 2024

  10. [10]

    SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

    Robey, A., et al. SmoothLLM : Defending Large Language Models Against Jailbreaking Attacks. arXiv:2310.03684, 2023

  11. [11]

    HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

    Mazeika, M., et al. HarmBench : A Standardized Evaluation Framework for Automated Red Teaming. arXiv:2402.04249, 2024

  12. [12]

    and Bisk, Y

    Belinkov, Y. and Bisk, Y. Synthetic and Natural Noise Both Break Neural Machine Translation. In ICLR, 2018

  13. [13]

    ByT5 : Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

    Xue, L., et al. ByT5 : Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. TACL, 2022

  14. [14]

    H., et al

    Clark, J. H., et al. CANINE : Pre-training an Efficient Tokenization-Free Encoder. TACL, 2022

  15. [15]

    Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

    Tay, Y., et al. Charformer: Fast Character Transformers via Gradient-based Subword Tokenization. In ICLR, 2022

  16. [16]

    SafeDecoding : Defending against Jailbreak Attacks via Safety-Aware Decoding

    Xu, Z., et al. SafeDecoding : Defending against Jailbreak Attacks via Safety-Aware Decoding. In ACL, 2024

  17. [17]

    Defending Against Unforeseen Failure Modes with Latent Adversarial Training

    Casper, S., et al. Defending Against Unforeseen Failure Modes with Latent Adversarial Training. arXiv:2403.05030, 2024

  18. [18]

    S2C : Split-and-Combine Jailbreak Attacks

    Wang, Y., Shi, J., and Qi, Y. S2C : Split-and-Combine Jailbreak Attacks. arXiv:2405.13965, 2024

  19. [19]

    Tokenization Falling Short: On Subword Robustness in Large Language Models

    Chai, Y., Fang, Y., and Peng, Q. Tokenization Falling Short: On Subword Robustness in Large Language Models. arXiv:2406.11687, 2024

  20. [20]

    Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

    Jang, E., Lee, K., and Chung, J.-W. Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers. arXiv:2501.02019, 2025

  21. [21]

    LBPE : Long-token-first Tokenization to Improve Large Language Models

    Lian, H., Xiong, Y., and Lin, Z. LBPE : Long-token-first Tokenization to Improve Large Language Models. arXiv:2404.18553, 2024 a

  22. [22]

    Scaffold- BPE : Enhancing Byte Pair Encoding with Scaffold Token Removal

    Lian, H., Xiong, Y., and Niu, J. Scaffold- BPE : Enhancing Byte Pair Encoding with Scaffold Token Removal. arXiv:2407.15626, 2024 b

  23. [23]

    LiteToken : Removing Intermediate Merge Residues From BPE Tokenizers

    Sun, Y., Yang, H., and Lin, Z. LiteToken : Removing Intermediate Merge Residues From BPE Tokenizers. arXiv, 2026

  24. [24]

    Papachappa, H. M. Morphological-Core Tokenization. arXiv, 2025

  25. [25]

    Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

    Li, T.-L. and Liu, H. Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models. arXiv:2506.24056, 2025

  26. [26]

    Asgari, E., El Kheir, Y., and Javaheri, M. A. S. MorphBPE : A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies. arXiv, 2025

  27. [27]

    Byte BPE Tokenization as an Inverse String Homomorphism

    Lian, H., Xiong, Y., and Lin, Z. Byte BPE Tokenization as an Inverse String Homomorphism. arXiv, 2024 c

  28. [28]

    Defending ChatGPT against Jailbreak Attack via Self-Reminders

    Xie, Y., Yi, J., Shao, J., Curl, J., Lyu, L., Chen, Q., Xie, X., and Wu, F. Defending ChatGPT against Jailbreak Attack via Self-Reminders. Nature Machine Intelligence, 5(12):1486--1496, 2023

  29. [29]

    Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

    Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama Guard : LLM -based Input-Output Safeguard for Human- AI Conversations. arXiv:2312.06674, 2024