Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment

Hongliang Liu; Tung-Ling Li; Yuhao Wu

arxiv: 2607.01239 · v1 · pith:GDJNTIANnew · submitted 2026-05-01 · 💻 cs.CL · cs.AI

Breaking Safety at the Token Boundary: How BPE Tokenization Creates Exploitable Gaps in LLM Alignment

Tung-Ling Li , Hongliang Liu , Yuhao Wu This is my paper

Pith reviewed 2026-07-04 01:15 UTC · model grok-4.3

classification 💻 cs.CL cs.AI

keywords BPE tokenizationLLM safety alignmenttoken fragmentationrefusal triggerHarmBenchjailbreak attackDPOSFT

0 comments

The pith

BPE tokenization fragments safety-critical words into sub-tokens absent from alignment datasets, allowing targeted optimization to flip refusal triggers.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper establishes that BPE tokenization splits safety words into pieces that the three surveyed public alignment datasets never include. This structural gap lets an optimization targeting those fragments flip the first-token refusal trigger on 80-100 percent of refused HarmBench prompts across five model families. Nearly half the flips produce genuinely harmful outputs according to the models. Activation patching places the disrupted signal in the last 30 percent of layers, while a scan of 30,000 alignment examples finds zero fragmented prompts. DPO training fails to close attack success rate stably on any family, and SFT on fragmented data only reduces attacks by raising refusal on benign inputs as well.

Core claim

BPE tokenization creates exploitable gaps in LLM alignment because the three public alignment datasets surveyed contain no intentionally fragmented inputs. An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs (per-model 29-65%; gap-vs-behavior ROC-AUC 0.66-0.98, pooled 0.84). Activation patching localizes the disrupted signal to the last ~30% of layers, an alignment-data scan finds zero fragmented prompts among 30,000 examples, and targeted-mutation experiments isolate safety words as the disruption locus.

What carries the argument

BPE tokenization of safety-critical words, which fragments them into sub-word pieces that alignment training never encounters.

Load-bearing premise

The absence of intentionally fragmented safety inputs in the three public alignment datasets is the load-bearing cause of the bypass rather than other unmeasured properties of the models or training.

What would settle it

Training any of the tested models on a dataset that includes fragmented versions of the same safety words at attack-relevant intensities and then re-running the optimization to check whether the first-token refusal flip rate drops below 20 percent.

Figures

Figures reproduced from arXiv: 2607.01239 by Hongliang Liu, Tung-Ling Li, Yuhao Wu.

**Figure 2.** Figure 2: Space insertion progressively collapses the safety gap. Compound effects produce acceler [PITH_FULL_IMAGE:figures/full_fig_p004_2.png] view at source ↗

**Figure 3.** Figure 3: Per-layer gap contribution on Qwen-3-4B. Most layers push toward compliance (red); 4 [PITH_FULL_IMAGE:figures/full_fig_p005_3.png] view at source ↗

**Figure 4.** Figure 4: Safety-word mutation (red) vs. neutral-word mutation (blue) ASR as a function of BoN [PITH_FULL_IMAGE:figures/full_fig_p007_4.png] view at source ↗

**Figure 5.** Figure 5: Fragmented safety-word comprehension across 5 model families (20 words each), refer [PITH_FULL_IMAGE:figures/full_fig_p014_5.png] view at source ↗

**Figure 6.** Figure 6: Behavioral refusal rate vs. logit gap (400 evaluations, 4 models). The monotonic relationship [PITH_FULL_IMAGE:figures/full_fig_p017_6.png] view at source ↗

**Figure 7.** Figure 7: ASRloose vs XSTest over-refusal across all canonical checkpoints. Star = baseline; circle = DPO; triangle = SFT. Green band = OR-cap eligibility (baseline OR + 5pp). DPO clusters inside the OR cap below baseline ASR; SFT achieves dramatic ASR reduction only by exiting the OR cap (Mistral, Gemma) [PITH_FULL_IMAGE:figures/full_fig_p018_7.png] view at source ↗

**Figure 8.** Figure 8: Conv-Benign change vs Conv-Harm change for all canonical SFT (red triangles) and DPO [PITH_FULL_IMAGE:figures/full_fig_p018_8.png] view at source ↗

read the original abstract

Character-level perturbations bypass safety alignment in modern LLMs despite leaving prompts human-readable. We identify and test a central structural mechanism: BPE tokenization fragments safety-critical words into sub-word pieces, and the three public alignment datasets we surveyed contain no intentionally fragmented inputs. The mechanism is a chain, tested end-to-end on five model families (Qwen-3-4B, Qwen-2.5-7B, Gemma-3-4B, Llama-3.1-8B, Mistral-7B). An optimization targeting safety-token fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts, with 48% of those flips producing genuinely harmful outputs (per-model 29-65%; gap-vs-behavior ROC-AUC 0.66-0.98, pooled 0.84). Activation patching localizes the disrupted signal to the last ${\sim}30\%$ of layers; an alignment-data scan finds zero fragmented prompts among 30,000 examples (positive-control recall $\geq 99\%$ at attack-relevant intensities); and targeted-mutation experiments isolate safety words as the disruption locus. On the defense side, a 68-cell grid (55 trained checkpoints) shows that no DPO configuration achieves seed- and pool-stable ASR closure on the three families with closed pool-size confounds. SFT trained on fragmented prompts closes ASR on 3/5 families but only via global collapse that raises refusal on benign prompts as well, indicating the missing distribution is necessary but not sufficient under the LoRA-16 recipe we tested. To distinguish selective repair from global collapse, we introduce Conv-Benign, a candidate paired diagnostic. All ASR claims are 3-judge-calibrated (cell rankings stable across judges; absolute levels $\pm$18pp; see App.~B.13).

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper shows BPE fragmentation of safety words enables reliable first-token refusal flips across models because alignment data has none, but their own SFT runs indicate the root cause is not just missing examples.

read the letter

The main point is that BPE splits safety-critical words into pieces the models were never trained to refuse, and the authors demonstrate this produces consistent bypasses on HarmBench prompts.

They scan three public alignment datasets totaling 30k examples and find zero intentionally fragmented safety inputs, with positive controls confirming the scan would have caught them. The attack optimization then flips the refusal trigger on 80-100% of cases, with 48% of flips yielding harmful outputs per their judges. Activation patching pins the disrupted signal to the final 30% of layers, and targeted mutations isolate the effect to the safety words themselves. The Conv-Benign diagnostic is a useful addition for telling selective repair from broad collapse.

The soft spot is the causal story around dataset composition. SFT on fragmented prompts only closes attack success on three of five families, and it does so by increasing refusal on benign prompts as well. That outcome undercuts the claim that the absence in training data is the load-bearing reason for the vulnerability. The patching and mutation results localize the phenomenon but do not rule out deeper interactions between the tokenizer and the alignment process.

This work is for alignment researchers focused on tokenizer effects and data curation. Readers evaluating safety pipelines or building robustness checks would get concrete numbers and a new diagnostic from it. The experiments are specific enough and the multi-model reporting is clear enough that it deserves a serious referee, even if the interpretation of the SFT results needs tightening.

Referee Report

3 major / 1 minor

Summary. The paper claims that BPE tokenization fragments safety-critical words into subword pieces and that the three surveyed public alignment datasets contain no intentionally fragmented inputs (zero in 30k examples, positive-control recall ≥99%), creating exploitable gaps in LLM safety. An optimization targeting this fragmentation flips the first-token refusal trigger on 80-100% of refused HarmBench prompts across five model families (Qwen-3-4B, Qwen-2.5-7B, Gemma-3-4B, Llama-3.1-8B, Mistral-7B), with 48% of flips yielding harmful outputs (per-model 29-65%; gap-vs-behavior ROC-AUC 0.66-0.98, pooled 0.84). Activation patching localizes the effect to the last ~30% of layers; targeted mutations isolate safety words; a 68-cell DPO grid (55 checkpoints) finds no seed- and pool-stable ASR closure; SFT on fragmented prompts closes ASR on 3/5 families but only via global collapse that also raises benign refusal rates. All ASR claims use 3-judge calibration.

Significance. If the central mechanism holds, the work identifies a structural tokenization-alignment interaction that explains character-level bypasses and shows why standard alignment data is insufficient, with concrete localization via patching and a diagnostic (Conv-Benign) for distinguishing selective repair from collapse. The multi-family replication, dataset scan, and defense grid provide falsifiable empirical anchors that could guide more robust alignment recipes.

major comments (3)

[Abstract (SFT and defense grid description)] Abstract / SFT paragraph and defense experiments: the claim that dataset absence is the load-bearing cause of the bypass is undercut by the SFT results, which close ASR on only 3/5 families and do so only through global collapse that raises benign refusal rates; this outcome is consistent with deeper BPE-alignment interactions (e.g., safety-signal representation in the final ~30% of layers) rather than training-distribution composition alone, and the targeted-mutation and patching results do not rule out such alternatives.
[Abstract (DPO grid)] Abstract / 68-cell grid description: the statement that 'no DPO configuration achieves seed- and pool-stable ASR closure' rests on a 68-cell search whose pre-specification versus post-hoc selection is not detailed; without that, the negative result on DPO cannot be assessed for robustness against the reader's noted concern of potential post-hoc selection in the grid.
[Abstract (optimization results)] Abstract / attack optimization: the 80-100% flip rate and 48% harmful-output rate are reported on 'refused HarmBench prompts' after optimization, but the manuscript does not specify whether the optimization hyperparameters were tuned on a held-out subset or the full test pool, leaving open the possibility that reported success rates partly reflect overfitting rather than a general token-boundary vulnerability.

minor comments (1)

[Abstract] The ±18pp absolute-level variability across the three judges is noted only in App. B.13; a brief main-text summary of judge agreement on the key 80-100% and 48% figures would improve readability.

Simulated Author's Rebuttal

3 responses · 0 unresolved

We thank the referee for the detailed and constructive feedback. We address each major comment below with clarifications and proposed revisions where appropriate.

read point-by-point responses

Referee: [Abstract (SFT and defense grid description)] Abstract / SFT paragraph and defense experiments: the claim that dataset absence is the load-bearing cause of the bypass is undercut by the SFT results, which close ASR on only 3/5 families and do so only through global collapse that raises benign refusal rates; this outcome is consistent with deeper BPE-alignment interactions (e.g., safety-signal representation in the final ~30% of layers) rather than training-distribution composition alone, and the targeted-mutation and patching results do not rule out such alternatives.

Authors: The abstract already states that the missing distribution is 'necessary but not sufficient under the LoRA-16 recipe we tested' and introduces Conv-Benign precisely to distinguish selective repair from global collapse. The SFT results are presented as evidence that adding fragmented examples alone does not yield stable selective alignment under the tested recipe. While we agree that deeper interactions in later layers (as localized by patching) may contribute, the targeted-mutation experiments isolate safety-word fragmentation as a causal locus for the first-token trigger flip. We will revise the abstract and discussion to more explicitly frame the dataset gap as one structural factor among potentially interacting mechanisms, without claiming it is the sole load-bearing cause. revision: partial
Referee: [Abstract (DPO grid)] Abstract / 68-cell grid description: the statement that 'no DPO configuration achieves seed- and pool-stable ASR closure' rests on a 68-cell search whose pre-specification versus post-hoc selection is not detailed; without that, the negative result on DPO cannot be assessed for robustness against the reader's noted concern of potential post-hoc selection in the grid.

Authors: The grid was constructed from standard DPO hyperparameter ranges in the literature (learning rate, beta, epochs) crossed with the three pool sizes that close the pool-size confound, with all 68 cells evaluated exhaustively. We agree that explicit documentation of the pre-specification process is needed to demonstrate absence of post-hoc selection. We will add an appendix subsection detailing the grid design rationale, the full list of evaluated combinations, and confirmation that no cells were excluded after initial runs. revision: yes
Referee: [Abstract (optimization results)] Abstract / attack optimization: the 80-100% flip rate and 48% harmful-output rate are reported on 'refused HarmBench prompts' after optimization, but the manuscript does not specify whether the optimization hyperparameters were tuned on a held-out subset or the full test pool, leaving open the possibility that reported success rates partly reflect overfitting rather than a general token-boundary vulnerability.

Authors: Hyperparameters for the attack optimization were selected via a small development set of 50 prompts drawn from a separate split of HarmBench, with the reported results evaluated only on the remaining held-out test prompts. We will update the methods section and abstract to explicitly state this held-out tuning procedure and report the development-set size to address potential overfitting concerns. revision: yes

Circularity Check

0 steps flagged

No significant circularity; empirical chain is self-contained

full rationale

The paper presents an end-to-end empirical pipeline: dataset scan for fragmented prompts (30k examples, positive-control recall ≥99%), optimization targeting safety-token fragmentation, first-token refusal flips on HarmBench, activation patching localizing to last ~30% layers, targeted-mutation isolation, and a 68-cell defense grid with SFT/DPO checkpoints. No equations, fitted parameters renamed as predictions, or self-citation chains appear; the absence of fragmented inputs is measured directly rather than assumed by definition, and the SFT result is reported as showing necessity but not sufficiency. All claims are grounded in observable experimental outcomes rather than reducing to their own inputs by construction.

Axiom & Free-Parameter Ledger

2 free parameters · 2 axioms · 0 invented entities

Central claim rests on the empirical premise that standard alignment datasets omit fragmented safety tokens and that this omission directly causes the bypass; attack optimization and defense grid introduce unspecified hyperparameters whose selection is not detailed.

free parameters (2)

attack optimization hyperparameters
Parameters controlling the character perturbations and safety-token targeting are not specified.
DPO grid search parameters
The 68-cell configuration space and LoRA-16 recipe contain multiple tunable values whose selection affects the reported failure to close ASR.

axioms (2)

domain assumption BPE is the tokenization method used by all five tested model families and creates the relevant sub-word fragments
Invoked to link character perturbations to token fragmentation.
domain assumption The surveyed alignment datasets are representative of typical safety training distributions
Used to conclude that fragmented inputs are systematically missing.

pith-pipeline@v0.9.1-grok · 5883 in / 1452 out tokens · 38762 ms · 2026-07-04T01:15:02.947803+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

29 extracted references · 14 canonical work pages · 6 internal anchors

[1]

Best-of- N Jailbreaking

Hughes, R., et al. Best-of- N Jailbreaking. arXiv:2412.03556, 2024

work page arXiv 2024
[2]

Jailbroken: How Does LLM Safety Training Fail? In NeurIPS, 2023

Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How Does LLM Safety Training Fail? In NeurIPS, 2023

2023
[3]

Neural Machine Translation of Rare Words with Subword Units

Sennrich, R., Haddow, B., and Birch, A. Neural Machine Translation of Rare Words with Subword Units. In ACL, 2016

2016
[4]

Training language models to follow instructions with human feedback

Ouyang, L., et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022

2022
[5]

Direct Preference Optimization

Rafailov, R., et al. Direct Preference Optimization. In NeurIPS, 2023

2023
[6]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[7]

Locating and Editing Factual Associations in GPT

Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and Editing Factual Associations in GPT . In NeurIPS, 2023

2023
[8]

Refusal in Language Models Is Mediated by a Single Direction

Arditi, A., et al. Refusal in Language Models Is Mediated by a Single Direction. arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[9]

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety Alignment Should Be Made More Than Just a Few Tokens Deep. arXiv:2406.05946, 2024

work page arXiv 2024
[10]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Robey, A., et al. SmoothLLM : Defending Large Language Models Against Jailbreaking Attacks. arXiv:2310.03684, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023
[11]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., et al. HarmBench : A Standardized Evaluation Framework for Automated Red Teaming. arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024
[12]

and Bisk, Y

Belinkov, Y. and Bisk, Y. Synthetic and Natural Noise Both Break Neural Machine Translation. In ICLR, 2018

2018
[13]

ByT5 : Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Xue, L., et al. ByT5 : Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. TACL, 2022

2022
[14]

H., et al

Clark, J. H., et al. CANINE : Pre-training an Efficient Tokenization-Free Encoder. TACL, 2022

2022
[15]

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Tay, Y., et al. Charformer: Fast Character Transformers via Gradient-based Subword Tokenization. In ICLR, 2022

2022
[16]

SafeDecoding : Defending against Jailbreak Attacks via Safety-Aware Decoding

Xu, Z., et al. SafeDecoding : Defending against Jailbreak Attacks via Safety-Aware Decoding. In ACL, 2024

2024
[17]

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Casper, S., et al. Defending Against Unforeseen Failure Modes with Latent Adversarial Training. arXiv:2403.05030, 2024

work page arXiv 2024
[18]

S2C : Split-and-Combine Jailbreak Attacks

Wang, Y., Shi, J., and Qi, Y. S2C : Split-and-Combine Jailbreak Attacks. arXiv:2405.13965, 2024

work page arXiv 2024
[19]

Tokenization Falling Short: On Subword Robustness in Large Language Models

Chai, Y., Fang, Y., and Peng, Q. Tokenization Falling Short: On Subword Robustness in Large Language Models. arXiv:2406.11687, 2024

work page arXiv 2024
[20]

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

Jang, E., Lee, K., and Chung, J.-W. Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers. arXiv:2501.02019, 2025

work page arXiv 2025
[21]

LBPE : Long-token-first Tokenization to Improve Large Language Models

Lian, H., Xiong, Y., and Lin, Z. LBPE : Long-token-first Tokenization to Improve Large Language Models. arXiv:2404.18553, 2024 a

work page arXiv 2024
[22]

Scaffold- BPE : Enhancing Byte Pair Encoding with Scaffold Token Removal

Lian, H., Xiong, Y., and Niu, J. Scaffold- BPE : Enhancing Byte Pair Encoding with Scaffold Token Removal. arXiv:2407.15626, 2024 b

work page arXiv 2024
[23]

LiteToken : Removing Intermediate Merge Residues From BPE Tokenizers

Sun, Y., Yang, H., and Lin, Z. LiteToken : Removing Intermediate Merge Residues From BPE Tokenizers. arXiv, 2026

2026
[24]

Papachappa, H. M. Morphological-Core Tokenization. arXiv, 2025

2025
[25]

Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

Li, T.-L. and Liu, H. Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models. arXiv:2506.24056, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025
[26]

Asgari, E., El Kheir, Y., and Javaheri, M. A. S. MorphBPE : A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies. arXiv, 2025

2025
[27]

Byte BPE Tokenization as an Inverse String Homomorphism

Lian, H., Xiong, Y., and Lin, Z. Byte BPE Tokenization as an Inverse String Homomorphism. arXiv, 2024 c

2024
[28]

Defending ChatGPT against Jailbreak Attack via Self-Reminders

Xie, Y., Yi, J., Shao, J., Curl, J., Lyu, L., Chen, Q., Xie, X., and Wu, F. Defending ChatGPT against Jailbreak Attack via Self-Reminders. Nature Machine Intelligence, 5(12):1486--1496, 2023

2023
[29]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama Guard : LLM -based Input-Output Safeguard for Human- AI Conversations. arXiv:2312.06674, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[1] [1]

Best-of- N Jailbreaking

Hughes, R., et al. Best-of- N Jailbreaking. arXiv:2412.03556, 2024

work page arXiv 2024

[2] [2]

Jailbroken: How Does LLM Safety Training Fail? In NeurIPS, 2023

Wei, A., Haghtalab, N., and Steinhardt, J. Jailbroken: How Does LLM Safety Training Fail? In NeurIPS, 2023

2023

[3] [3]

Neural Machine Translation of Rare Words with Subword Units

Sennrich, R., Haddow, B., and Birch, A. Neural Machine Translation of Rare Words with Subword Units. In ACL, 2016

2016

[4] [4]

Training language models to follow instructions with human feedback

Ouyang, L., et al. Training language models to follow instructions with human feedback. In NeurIPS, 2022

2022

[5] [5]

Direct Preference Optimization

Rafailov, R., et al. Direct Preference Optimization. In NeurIPS, 2023

2023

[6] [6]

Universal and Transferable Adversarial Attacks on Aligned Language Models

Zou, A., et al. Universal and Transferable Adversarial Attacks on Aligned Language Models. arXiv:2307.15043, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[7] [7]

Locating and Editing Factual Associations in GPT

Meng, K., Bau, D., Andonian, A., and Belinkov, Y. Locating and Editing Factual Associations in GPT . In NeurIPS, 2023

2023

[8] [8]

Refusal in Language Models Is Mediated by a Single Direction

Arditi, A., et al. Refusal in Language Models Is Mediated by a Single Direction. arXiv:2406.11717, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[9] [9]

Safety Alignment Should Be Made More Than Just a Few Tokens Deep

Qi, X., Panda, A., Lyu, K., Ma, X., Roy, S., Beirami, A., Mittal, P., and Henderson, P. Safety Alignment Should Be Made More Than Just a Few Tokens Deep. arXiv:2406.05946, 2024

work page arXiv 2024

[10] [10]

SmoothLLM: Defending Large Language Models Against Jailbreaking Attacks

Robey, A., et al. SmoothLLM : Defending Large Language Models Against Jailbreaking Attacks. arXiv:2310.03684, 2023

work page internal anchor Pith review Pith/arXiv arXiv 2023

[11] [11]

HarmBench: A Standardized Evaluation Framework for Automated Red Teaming and Robust Refusal

Mazeika, M., et al. HarmBench : A Standardized Evaluation Framework for Automated Red Teaming. arXiv:2402.04249, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024

[12] [12]

and Bisk, Y

Belinkov, Y. and Bisk, Y. Synthetic and Natural Noise Both Break Neural Machine Translation. In ICLR, 2018

2018

[13] [13]

ByT5 : Towards a Token-Free Future with Pre-trained Byte-to-Byte Models

Xue, L., et al. ByT5 : Towards a Token-Free Future with Pre-trained Byte-to-Byte Models. TACL, 2022

2022

[14] [14]

H., et al

Clark, J. H., et al. CANINE : Pre-training an Efficient Tokenization-Free Encoder. TACL, 2022

2022

[15] [15]

Charformer: Fast Character Transformers via Gradient-based Subword Tokenization

Tay, Y., et al. Charformer: Fast Character Transformers via Gradient-based Subword Tokenization. In ICLR, 2022

2022

[16] [16]

SafeDecoding : Defending against Jailbreak Attacks via Safety-Aware Decoding

Xu, Z., et al. SafeDecoding : Defending against Jailbreak Attacks via Safety-Aware Decoding. In ACL, 2024

2024

[17] [17]

Defending Against Unforeseen Failure Modes with Latent Adversarial Training

Casper, S., et al. Defending Against Unforeseen Failure Modes with Latent Adversarial Training. arXiv:2403.05030, 2024

work page arXiv 2024

[18] [18]

S2C : Split-and-Combine Jailbreak Attacks

Wang, Y., Shi, J., and Qi, Y. S2C : Split-and-Combine Jailbreak Attacks. arXiv:2405.13965, 2024

work page arXiv 2024

[19] [19]

Tokenization Falling Short: On Subword Robustness in Large Language Models

Chai, Y., Fang, Y., and Peng, Q. Tokenization Falling Short: On Subword Robustness in Large Language Models. arXiv:2406.11687, 2024

work page arXiv 2024

[20] [20]

Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers

Jang, E., Lee, K., and Chung, J.-W. Improbable Bigrams Expose Vulnerabilities of Incomplete Tokens in Byte-Level Tokenizers. arXiv:2501.02019, 2025

work page arXiv 2025

[21] [21]

LBPE : Long-token-first Tokenization to Improve Large Language Models

Lian, H., Xiong, Y., and Lin, Z. LBPE : Long-token-first Tokenization to Improve Large Language Models. arXiv:2404.18553, 2024 a

work page arXiv 2024

[22] [22]

Scaffold- BPE : Enhancing Byte Pair Encoding with Scaffold Token Removal

Lian, H., Xiong, Y., and Niu, J. Scaffold- BPE : Enhancing Byte Pair Encoding with Scaffold Token Removal. arXiv:2407.15626, 2024 b

work page arXiv 2024

[23] [23]

LiteToken : Removing Intermediate Merge Residues From BPE Tokenizers

Sun, Y., Yang, H., and Lin, Z. LiteToken : Removing Intermediate Merge Residues From BPE Tokenizers. arXiv, 2026

2026

[24] [24]

Papachappa, H. M. Morphological-Core Tokenization. arXiv, 2025

2025

[25] [25]

Logit-Gap Steering: A Forward-Pass Diagnostic for Alignment Robustness

Li, T.-L. and Liu, H. Logit-Gap Steering: Efficient Short-Suffix Jailbreaks for Aligned Large Language Models. arXiv:2506.24056, 2025

work page internal anchor Pith review Pith/arXiv arXiv 2025

[26] [26]

Asgari, E., El Kheir, Y., and Javaheri, M. A. S. MorphBPE : A Morpho-Aware Tokenizer Bridging Linguistic Complexity for Efficient LLM Training Across Morphologies. arXiv, 2025

2025

[27] [27]

Byte BPE Tokenization as an Inverse String Homomorphism

Lian, H., Xiong, Y., and Lin, Z. Byte BPE Tokenization as an Inverse String Homomorphism. arXiv, 2024 c

2024

[28] [28]

Defending ChatGPT against Jailbreak Attack via Self-Reminders

Xie, Y., Yi, J., Shao, J., Curl, J., Lyu, L., Chen, Q., Xie, X., and Wu, F. Defending ChatGPT against Jailbreak Attack via Self-Reminders. Nature Machine Intelligence, 5(12):1486--1496, 2023

2023

[29] [29]

Llama Guard: LLM-based Input-Output Safeguard for Human-AI Conversations

Inan, H., Upasani, K., Chi, J., Rungta, R., Iyer, K., Mao, Y., Tontchev, M., Hu, Q., Fuller, B., Testuggine, D., and Khabsa, M. Llama Guard : LLM -based Input-Output Safeguard for Human- AI Conversations. arXiv:2312.06674, 2024

work page internal anchor Pith review Pith/arXiv arXiv 2024