Peek2: Regex-free Byte-level Byte-Pair Encoding Pretokenizer for LLM Inference on Edge Devices

Iraklis Klampanos; Liu Zai

arxiv: 2601.05833 · v2 · submitted 2026-01-09 · 💻 cs.CL

Peek2: Regex-free Byte-level Byte-Pair Encoding Pretokenizer for LLM Inference on Edge Devices

Liu Zai , Iraklis Klampanos This is my paper

Pith reviewed 2026-05-16 15:56 UTC · model grok-4.3

classification 💻 cs.CL

keywords pretokenizationbyte pair encodingBPELLMedge computingtokenizerregex optimizationinference optimization

0 comments

The pith

A linear-scan pretokenizer replaces regex in cl100k tokenizers while producing identical results and higher speed.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

This paper introduces Peek2 as a regex-free alternative to the pretokenization step in byte-level BPE tokenizers like those in GPT-3 and Llama-3. By analyzing the original cl100k logic, the authors created a linear-time algorithm with constant memory that serves as a direct replacement. Tests demonstrate speed improvements of up to 2.48 times in microbenchmarks and 1.14 times overall without altering the tokenized output. The work targets faster LLM inference on edge devices where regex operations are costly.

Core claim

Peek2 is a new pretokenization algorithm that replicates the behavior of the cl100k regex-based pretokenizer using a linear scan with linear time complexity and constant memory usage, achieving identical pretokenization results and increased throughput for byte-level BPE encoding.

What carries the argument

The Peek2 linear scan algorithm, which processes input bytes sequentially to identify pretoken boundaries without regular expressions.

If this is right

Microbenchmark throughput increases by up to 2.48× depending on the dataset.
Overall throughput for the entire Byte-level BPE encoding process improves by 1.14×.
Memory usage remains constant and trivial, making it suitable for edge devices.
Results match the baseline regex tokenizer exactly, allowing drop-in replacement.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar linear-scan replacements could optimize other regex-dependent preprocessing steps in NLP pipelines.
This approach may enable real-time tokenization on very low-power hardware where regex libraries are unavailable or slow.
Future work could test Peek2 on additional datasets or integrate it into full tokenizer libraries for broader adoption.

Load-bearing premise

The authors' analysis of cl100k logic covers all possible input cases so that the linear scan always matches the regex output exactly.

What would settle it

Finding any input string where the pretokenized segments from Peek2 differ from those produced by the original cl100k regex pretokenizer would show the methods are not equivalent.

read the original abstract

Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers, yet little work has been done to optimize it for edge-side inference. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like pretokenizers used in GPT-3, LLaMa-3, and Qwen-2.5. After breaking down and analyzing the logic of the original cl100k pretokenizer, we introduced a new pretokenization algorithm with linear time complexity and constant, trivial memory usage, suited for edge scenarios. Test results show that it increases microbenchmarking throughput by up to $ 2.48\times $ and delivers a $ 1.14\times $ improvement in overall throughput across the entire Byte-level BPE encoding process, depending on the dataset, while providing identical results as the baseline Regex-based tokenizer.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Referee Report

2 major / 2 minor

Summary. The paper proposes Peek2 as a regex-free, linear-time, constant-memory pretokenizer that serves as a drop-in replacement for the cl100k regex-based pretokenizer used in GPT-3, LLaMA-3, and Qwen-2.5. After analyzing the original logic, the authors implement a new algorithm and report up to 2.48× microbenchmark throughput gains and 1.14× end-to-end BPE encoding improvement while claiming byte-for-byte identical output to the baseline.

Significance. If the identical-output claim holds for all inputs, the result would be significant for edge-device LLM inference: it replaces a sequential regex pass with a simple linear scan, directly addressing latency and memory constraints without changing token sequences.

major comments (2)

[Abstract] Abstract: the central claim that Peek2 produces identical results to the cl100k regex tokenizer for every input is load-bearing for the drop-in-replacement guarantee, yet the manuscript provides no enumeration of handled regex patterns, no description of the test corpus, and no verification method or error analysis.
The assumption that the authors' breakdown of cl100k logic captures all edge cases (overlapping matches, boundary bytes, rare Unicode sequences) is not supported by any formal equivalence argument or exhaustive test coverage, leaving the reported speedups conditional on unverified correctness.

minor comments (2)

[Abstract] The abstract should name the specific datasets used for the 2.48× and 1.14× measurements to support reproducibility.
Consider adding a short table or appendix listing the regex constructs replaced by the linear scan for transparency.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive feedback emphasizing the importance of rigorously validating the identical-output claim, which underpins the drop-in replacement value for edge-device inference. We address each major comment below and will revise the manuscript to incorporate additional details on patterns, testing, and edge cases.

read point-by-point responses

Referee: [Abstract] Abstract: the central claim that Peek2 produces identical results to the cl100k regex tokenizer for every input is load-bearing for the drop-in-replacement guarantee, yet the manuscript provides no enumeration of handled regex patterns, no description of the test corpus, and no verification method or error analysis.

Authors: We agree these elements are needed to fully support the claim. In the revised manuscript we will enumerate the specific cl100k regex patterns replicated by the algorithm, describe the test corpus (including samples from Common Crawl, code repositories, and multilingual text), detail the verification method (parallel execution of both tokenizers with byte-for-byte output comparison), and add an error analysis section reporting zero discrepancies across the evaluated inputs. revision: yes
Referee: The assumption that the authors' breakdown of cl100k logic captures all edge cases (overlapping matches, boundary bytes, rare Unicode sequences) is not supported by any formal equivalence argument or exhaustive test coverage, leaving the reported speedups conditional on unverified correctness.

Authors: We acknowledge the lack of a formal equivalence proof in the current version. Our algorithm was derived from direct analysis of the cl100k logic, and all reported benchmarks used inputs that produced identical outputs. We will add a subsection discussing tested edge cases (overlapping matches, boundary bytes, rare Unicode sequences) with empirical results from our test suite. While exhaustive formal proof is beyond the engineering scope of this work, the expanded empirical validation will make the correctness claim more robust; the speedups remain unconditional on the tested workloads. revision: partial

Circularity Check

0 steps flagged

No circularity: direct reimplementation from external analysis

full rationale

The paper's core contribution is an algorithmic reimplementation of cl100k pretokenization logic obtained by manual breakdown of the baseline. No equations, parameters, or predictions are defined in terms of their own outputs. No self-citations are load-bearing for the central claim, and no uniqueness theorems or ansatzes are imported from prior author work. The identical-output guarantee is asserted via empirical testing rather than by construction, making the derivation self-contained against external benchmarks.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the domain assumption that the cl100k pretokenizer behavior can be exactly replicated by a non-regex linear scan; no free parameters or new entities are introduced.

axioms (1)

domain assumption The logic of the cl100k regex pretokenizer can be fully replicated by a linear scan without regex.
This is the load-bearing premise that allows the new algorithm to serve as a drop-in replacement.

pith-pipeline@v0.9.0 · 5450 in / 1270 out tokens · 51396 ms · 2026-05-16T15:56:32.832157+00:00 · methodology

Peek2: Regex-free Byte-level Byte-Pair Encoding Pretokenizer for LLM Inference on Edge Devices

Core claim

What carries the argument

If this is right

Where Pith is reading between the lines

Load-bearing premise

What would settle it

discussion (0)