Peek2: Regex-free Byte-level Byte-Pair Encoding Pretokenizer for LLM Inference on Edge Devices
Pith reviewed 2026-05-16 15:56 UTC · model grok-4.3
The pith
A linear-scan pretokenizer replaces regex in cl100k tokenizers while producing identical results and higher speed.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Peek2 is a new pretokenization algorithm that replicates the behavior of the cl100k regex-based pretokenizer using a linear scan with linear time complexity and constant memory usage, achieving identical pretokenization results and increased throughput for byte-level BPE encoding.
What carries the argument
The Peek2 linear scan algorithm, which processes input bytes sequentially to identify pretoken boundaries without regular expressions.
If this is right
- Microbenchmark throughput increases by up to 2.48× depending on the dataset.
- Overall throughput for the entire Byte-level BPE encoding process improves by 1.14×.
- Memory usage remains constant and trivial, making it suitable for edge devices.
- Results match the baseline regex tokenizer exactly, allowing drop-in replacement.
Where Pith is reading between the lines
- Similar linear-scan replacements could optimize other regex-dependent preprocessing steps in NLP pipelines.
- This approach may enable real-time tokenization on very low-power hardware where regex libraries are unavailable or slow.
- Future work could test Peek2 on additional datasets or integrate it into full tokenizer libraries for broader adoption.
Load-bearing premise
The authors' analysis of cl100k logic covers all possible input cases so that the linear scan always matches the regex output exactly.
What would settle it
Finding any input string where the pretokenized segments from Peek2 differ from those produced by the original cl100k regex pretokenizer would show the methods are not equivalent.
read the original abstract
Pretokenization is a crucial, sequential pass in Byte-level BPE tokenizers, yet little work has been done to optimize it for edge-side inference. Our proposed new implementation, Peek2, serves as a drop-in replacement for cl100k-like pretokenizers used in GPT-3, LLaMa-3, and Qwen-2.5. After breaking down and analyzing the logic of the original cl100k pretokenizer, we introduced a new pretokenization algorithm with linear time complexity and constant, trivial memory usage, suited for edge scenarios. Test results show that it increases microbenchmarking throughput by up to $ 2.48\times $ and delivers a $ 1.14\times $ improvement in overall throughput across the entire Byte-level BPE encoding process, depending on the dataset, while providing identical results as the baseline Regex-based tokenizer.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes Peek2 as a regex-free, linear-time, constant-memory pretokenizer that serves as a drop-in replacement for the cl100k regex-based pretokenizer used in GPT-3, LLaMA-3, and Qwen-2.5. After analyzing the original logic, the authors implement a new algorithm and report up to 2.48× microbenchmark throughput gains and 1.14× end-to-end BPE encoding improvement while claiming byte-for-byte identical output to the baseline.
Significance. If the identical-output claim holds for all inputs, the result would be significant for edge-device LLM inference: it replaces a sequential regex pass with a simple linear scan, directly addressing latency and memory constraints without changing token sequences.
major comments (2)
- [Abstract] Abstract: the central claim that Peek2 produces identical results to the cl100k regex tokenizer for every input is load-bearing for the drop-in-replacement guarantee, yet the manuscript provides no enumeration of handled regex patterns, no description of the test corpus, and no verification method or error analysis.
- The assumption that the authors' breakdown of cl100k logic captures all edge cases (overlapping matches, boundary bytes, rare Unicode sequences) is not supported by any formal equivalence argument or exhaustive test coverage, leaving the reported speedups conditional on unverified correctness.
minor comments (2)
- [Abstract] The abstract should name the specific datasets used for the 2.48× and 1.14× measurements to support reproducibility.
- Consider adding a short table or appendix listing the regex constructs replaced by the linear scan for transparency.
Simulated Author's Rebuttal
We thank the referee for the constructive feedback emphasizing the importance of rigorously validating the identical-output claim, which underpins the drop-in replacement value for edge-device inference. We address each major comment below and will revise the manuscript to incorporate additional details on patterns, testing, and edge cases.
read point-by-point responses
-
Referee: [Abstract] Abstract: the central claim that Peek2 produces identical results to the cl100k regex tokenizer for every input is load-bearing for the drop-in-replacement guarantee, yet the manuscript provides no enumeration of handled regex patterns, no description of the test corpus, and no verification method or error analysis.
Authors: We agree these elements are needed to fully support the claim. In the revised manuscript we will enumerate the specific cl100k regex patterns replicated by the algorithm, describe the test corpus (including samples from Common Crawl, code repositories, and multilingual text), detail the verification method (parallel execution of both tokenizers with byte-for-byte output comparison), and add an error analysis section reporting zero discrepancies across the evaluated inputs. revision: yes
-
Referee: The assumption that the authors' breakdown of cl100k logic captures all edge cases (overlapping matches, boundary bytes, rare Unicode sequences) is not supported by any formal equivalence argument or exhaustive test coverage, leaving the reported speedups conditional on unverified correctness.
Authors: We acknowledge the lack of a formal equivalence proof in the current version. Our algorithm was derived from direct analysis of the cl100k logic, and all reported benchmarks used inputs that produced identical outputs. We will add a subsection discussing tested edge cases (overlapping matches, boundary bytes, rare Unicode sequences) with empirical results from our test suite. While exhaustive formal proof is beyond the engineering scope of this work, the expanded empirical validation will make the correctness claim more robust; the speedups remain unconditional on the tested workloads. revision: partial
Circularity Check
No circularity: direct reimplementation from external analysis
full rationale
The paper's core contribution is an algorithmic reimplementation of cl100k pretokenization logic obtained by manual breakdown of the baseline. No equations, parameters, or predictions are defined in terms of their own outputs. No self-citations are load-bearing for the central claim, and no uniqueness theorems or ansatzes are imported from prior author work. The identical-output guarantee is asserted via empirical testing rather than by construction, making the derivation self-contained against external benchmarks.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption The logic of the cl100k regex pretokenizer can be fully replicated by a linear scan without regex.
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.