Validating UTF-8 In Less Than One Instruction Per Byte

Daniel Lemire; John Keiser

arxiv: 2010.03090 · v5 · submitted 2020-10-06 · 💻 cs.DB

Validating UTF-8 In Less Than One Instruction Per Byte

John Keiser , Daniel Lemire This is my paper

Pith reviewed 2026-05-24 14:36 UTC · model grok-4.3

classification 💻 cs.DB

keywords UTF-8 validationSIMD instructionslookup algorithmtext processingperformance optimizationstring validationencoding check

0 comments

The pith

The lookup algorithm validates UTF-8 using SIMD instructions at less than one instruction per byte.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper presents the lookup algorithm as a method to validate UTF-8 text. It claims this approach runs more than ten times faster than validation routines in many existing libraries and languages. The method depends on SIMD instructions that are standard on current x86 and ARM processors. A reader would care because UTF-8 is the most common text format and validation happens frequently during data ingestion. If correct, the result would allow substantially quicker handling of text without changing hardware.

Core claim

We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software.

What carries the argument

The lookup algorithm, which combines SIMD vector operations with precomputed lookup tables to classify and reject invalid UTF-8 byte sequences in parallel.

If this is right

Text ingestion pipelines in databases and servers can process data at higher throughput.
Standard library UTF-8 validators could be updated to adopt the faster method.
Applications handling large volumes of text would see reduced CPU time on modern processors.
High-speed data streams could incorporate validation without becoming a bottleneck.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

The same lookup-table-plus-SIMD pattern may extend to fast validation of other encodings or formats such as JSON.
Developers working on performance-critical text code would benefit from adopting or porting the open-source implementation.
The result underscores how hardware vector instructions can change the cost of basic string operations.

Load-bearing premise

The algorithm requires specific SIMD instructions on the CPU and that its lookup tables and branch logic correctly reject every invalid sequence without errors.

What would settle it

Measure the algorithm on a large mixed corpus of valid and invalid UTF-8 data and check whether it rejects precisely the invalid sequences while matching the reported instruction count per byte.

read the original abstract

The majority of text is stored in UTF-8, which must be validated on ingestion. We present the lookup algorithm, which outperforms UTF-8 validation routines used in many libraries and languages by more than 10 times using commonly available SIMD instructions. To ensure reproducibility, our work is freely available as open source software.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

The paper gives a concrete SIMD lookup-table method for UTF-8 validation that claims over 10x speedup versus common libraries and ships the code for direct checking.

read the letter

The core contribution is a specific combination of lookup tables and SIMD instructions that validates UTF-8 at less than one instruction per byte. The authors report this beats the routines in several widely used libraries and languages, and they release the implementation as open source. That combination of a practical algorithm plus runnable code is the main thing worth noting here. The work is narrow but lands in an area where UTF-8 checks sit on hot paths for ingestion and parsing, so the performance delta matters to people who care about that bottleneck. The open-source release is a real plus because it lets others confirm both correctness across all invalid sequences and the measured speeds on their own hardware. No equations or fitted models appear, so the usual circularity worries do not apply; the numbers are direct measurements. The main soft spot is that the abstract gives no benchmark details, error bars, or test-suite description, which means a referee would need the full paper and the code to judge whether the comparisons are fair across compilers, data sets, and CPU generations. That is a standard concern rather than a fatal one. The paper is aimed at engineers and researchers who build high-throughput text processors and need a drop-in faster validator. It is not a broad theoretical advance, but the technique is self-contained and the artifacts make it easy to evaluate. I would bring it to a reading group focused on systems or performance work. I would not cite it in my own papers unless I were actively using or extending the validator. It is solid enough on its own terms to deserve peer review rather than a desk reject.

Referee Report

2 major / 2 minor

Summary. The paper presents the 'lookup algorithm' for UTF-8 validation, which uses commonly available SIMD instructions (on x86 and ARM) together with lookup tables and branch logic to achieve validation at a rate of less than one instruction per byte. It claims this outperforms UTF-8 validation routines in many libraries and languages by more than 10x and releases the implementation as open-source software to support reproducibility.

Significance. If the performance and correctness claims hold under independent verification, the result would be significant for any system performing high-volume text ingestion or processing, as UTF-8 validation is a frequent bottleneck. The open-source release of the lookup tables, branch logic, and implementation is a clear strength that directly enables falsifiable checks of both speed and correctness on all invalid sequences.

major comments (2)

[Abstract / Experimental section] The central performance claim (more than 10x speedup, less than one instruction per byte) is presented without benchmark details, hardware specifications, baseline library versions, or error-bar information in the abstract; the full manuscript must supply these in the experimental evaluation section to substantiate the claim against post-hoc selection or platform-specific effects.
[Correctness / Algorithm description] Correctness of the lookup tables and branch logic is asserted but no proof, exhaustive enumeration of invalid sequences, or test harness description is referenced; because the algorithm's validity is load-bearing for the speedup claim, the manuscript should include or cite a machine-checked argument or complete test coverage in the correctness section.

minor comments (2)

[Abstract] The abstract states the performance claim and open-source availability but does not name the specific SIMD instruction sets or the exact comparison targets; adding these would improve clarity.
[Introduction / Algorithm] Notation for the lookup tables and the precise definition of 'instruction per byte' should be introduced earlier to avoid ambiguity when reading the performance numbers.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the positive assessment and constructive comments. We address the two major comments point by point below.

read point-by-point responses

Referee: [Abstract / Experimental section] The central performance claim (more than 10x speedup, less than one instruction per byte) is presented without benchmark details, hardware specifications, baseline library versions, or error-bar information in the abstract; the full manuscript must supply these in the experimental evaluation section to substantiate the claim against post-hoc selection or platform-specific effects.

Authors: We agree that abstracts conventionally omit such details. The manuscript's experimental evaluation section already reports results on x86 and ARM platforms, comparisons against library implementations, and measurements supporting the <1 instruction/byte and >10x claims. To strengthen substantiation, we will revise the experimental section to add explicit hardware specifications, exact library versions tested, multiple runs with error bars, and discussion of platform effects. revision: yes
Referee: [Correctness / Algorithm description] Correctness of the lookup tables and branch logic is asserted but no proof, exhaustive enumeration of invalid sequences, or test harness description is referenced; because the algorithm's validity is load-bearing for the speedup claim, the manuscript should include or cite a machine-checked argument or complete test coverage in the correctness section.

Authors: The open-source release is provided precisely to enable independent verification of correctness on all invalid sequences. We will add a subsection describing the test harness and how exhaustive enumeration of invalid UTF-8 inputs is performed via the released code. This supplies complete test coverage. A machine-checked formal proof is outside the scope of this performance paper. revision: yes

Circularity Check

0 steps flagged

No significant circularity

full rationale

This is an algorithmic paper describing a SIMD-based UTF-8 validator with lookup tables and branch logic. No equations, fitted parameters, predictions, or first-principles derivations appear in the provided text. Performance results are presented as direct empirical measurements on hardware, not quantities derived from the algorithm itself. The manuscript supplies open-source code, enabling independent verification of correctness and timings outside any internal chain. No self-citations, ansatzes, or renamings reduce the central claim to its own inputs by construction.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim rests on the existence and correct behavior of SIMD instructions on target hardware and on the assumption that the lookup tables encode all UTF-8 validity rules without omission.

axioms (1)

domain assumption Target CPUs provide the SIMD instructions used by the algorithm.
Performance claim is stated for commonly available SIMD; without these instructions the algorithm cannot run at the claimed speed.

pith-pipeline@v0.9.0 · 5561 in / 1109 out tokens · 20665 ms · 2026-05-24T14:36:19.379062+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

23 extracted references · 23 canonical work pages

[1]

Internet Engineering T ask Force, Request for Comments:

Y ergeau F, UTF-8, a transformation format of ISO 10646; 2015. Internet Engineering T ask Force, Request for Comments:

work page 2015
[2]

https://tools.ietf.org/html/rfc3629 [last checked July 2020]

work page 2020
[3]

https://capec.mitre

The MITRE Corporation, CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic; 2019. https://capec.mitre. org/data/definitions/80.html [last checked July 2020]

work page 2019
[4]

https://github.com/lz4/lz4 [last checked July 2020]

Collet Y, et al., LZ4 - Extremely fast compression; 2020. https://github.com/lz4/lz4 [last checked July 2020]

work page 2020
[5]

ScyllaDB optimizes database architecture to maximize hardware performance

Suneja N. ScyllaDB optimizes database architecture to maximize hardware performance. IEEE Software 2019;36(4):96– 100

work page 2019
[6]

https://bit.ly/2VrlQ37 [last checked July 2020]

Cai Y, Utils: optimize UTF-8 validation; 2019. https://bit.ly/2VrlQ37 [last checked July 2020]

work page 2019
[7]

Improving Energy E/uniFB03ciency through Parallelization and Vectorization on Intel Core i5 and i7 Processors

Cebrián JM, Natvig L, Meyer JC. Improving Energy E/uniFB03ciency through Parallelization and Vectorization on Intel Core i5 and i7 Processors. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis; 2012. p. 675–684

work page 2012
[8]

Auto-vectorization of interleaved data for SIMD

Nuzman D, Rosen I, Zaks A. Auto-vectorization of interleaved data for SIMD. ACM SIGPLAN Notices 2006;41(6):132– 143

work page 2006
[9]

Software internationalization and localization: An industrial experience

Xia X, Lo D, Zhu F, Wang X, Zhou B. Software internationalization and localization: An industrial experience. In: 2013 18th International Conference on Engineering of Complex Computer Systems IEEE; 2013. p. 222–231

work page 2013
[10]

Fuchsia OS-A threat to Android

Singh T, Bhardwaj R. Fuchsia OS-A threat to Android. IITM Journal of Management and IT 2019;10(1):65–67

work page 2019
[11]

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ [last checked July 2020]

Höhrmann B, Flexible and Economical UTF-8 Decoder; 2010. http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ [last checked July 2020]

work page 2010
[12]

Parsing gigabytes of JSON per second

Langdale G, Lemire D. Parsing gigabytes of JSON per second. The VLDB Journal 2019;28(6):941–960

work page 2019
[13]

Reducibility among combinatorial problems

Karp RM. Reducibility among combinatorial problems. In: Complexity of computer computations Springer; 1972.p. 85–103

work page 1972
[14]

Scienti/uniFB01c benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results

Hoe/uniFB02er T, Belli R. Scienti/uniFB01c benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In: Proceedings of the international conference for high performance computing, networking, storage and analysis; 2015. p. 1–12

work page 2015
[15]

Faster Base64 encoding and decoding using AVX2 instructions

Muła W, Lemire D. Faster Base64 encoding and decoding using AVX2 instructions. ACM T ransactions on the Web 2018;12(3):1–26

work page 2018
[16]

Base64 encoding and decoding at almost the speed of a memory copy

Muła W, Lemire D. Base64 encoding and decoding at almost the speed of a memory copy. Software: Practice and Experience 2020;50(2):89–97

work page 2020
[17]

High Performance XML Parsing Using Parallel Bit Stream T echnology

Cameron RD, Herdy KS, Lin D. High Performance XML Parsing Using Parallel Bit Stream T echnology. In: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds CASCON ’08, New Y ork, NY, USA: ACM; 2008. p. 17:222–17:235. John Keiser and Daniel Lemire 19

work page 2008
[18]

Data-parallel Finite-state Machines

Mytkowicz T, Musuvathi M, Schulte W. Data-parallel Finite-state Machines. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS ’14, New Y ork, NY, USA: ACM; 2014. p. 529–542

work page 2014
[19]

Instant Loading for Main Memory Databases

Mühlbauer T, Rödiger W, Seilbeck R, Reiser A, Kemper A, Neumann T. Instant Loading for Main Memory Databases. Proc VLDB Endow 2013 Sep;6(14):1702–1713

work page 2013
[20]

A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding

Cameron RD. A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding. In: Proceed- ings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming ACM; 2008. p. 91–98

work page 2008
[21]

Combining SIMD and Many/Multi-core parallelism for /uniFB01nite state machines with enumerative spec- ulation

Jiang P , Agrawal G. Combining SIMD and Many/Multi-core parallelism for /uniFB01nite state machines with enumerative spec- ulation. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

work page
[22]

The ARM scalable vector extension

Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, et al. The ARM scalable vector extension. IEEE Micro 2017;37(2):26–39

work page 2017
[23]

Vectorization cost modeling for NEON, AVX and SVE

Pohl A, Cosenza B, Juurlink B. Vectorization cost modeling for NEON, AVX and SVE. Performance Evaluation 2020;140– 141:102106

work page 2020

[1] [1]

Internet Engineering T ask Force, Request for Comments:

Y ergeau F, UTF-8, a transformation format of ISO 10646; 2015. Internet Engineering T ask Force, Request for Comments:

work page 2015

[2] [2]

https://tools.ietf.org/html/rfc3629 [last checked July 2020]

work page 2020

[3] [3]

https://capec.mitre

The MITRE Corporation, CAPEC-80: Using UTF-8 Encoding to Bypass Validation Logic; 2019. https://capec.mitre. org/data/definitions/80.html [last checked July 2020]

work page 2019

[4] [4]

https://github.com/lz4/lz4 [last checked July 2020]

Collet Y, et al., LZ4 - Extremely fast compression; 2020. https://github.com/lz4/lz4 [last checked July 2020]

work page 2020

[5] [5]

ScyllaDB optimizes database architecture to maximize hardware performance

Suneja N. ScyllaDB optimizes database architecture to maximize hardware performance. IEEE Software 2019;36(4):96– 100

work page 2019

[6] [6]

https://bit.ly/2VrlQ37 [last checked July 2020]

Cai Y, Utils: optimize UTF-8 validation; 2019. https://bit.ly/2VrlQ37 [last checked July 2020]

work page 2019

[7] [7]

Improving Energy E/uniFB03ciency through Parallelization and Vectorization on Intel Core i5 and i7 Processors

Cebrián JM, Natvig L, Meyer JC. Improving Energy E/uniFB03ciency through Parallelization and Vectorization on Intel Core i5 and i7 Processors. In: 2012 SC Companion: High Performance Computing, Networking Storage and Analysis; 2012. p. 675–684

work page 2012

[8] [8]

Auto-vectorization of interleaved data for SIMD

Nuzman D, Rosen I, Zaks A. Auto-vectorization of interleaved data for SIMD. ACM SIGPLAN Notices 2006;41(6):132– 143

work page 2006

[9] [9]

Software internationalization and localization: An industrial experience

Xia X, Lo D, Zhu F, Wang X, Zhou B. Software internationalization and localization: An industrial experience. In: 2013 18th International Conference on Engineering of Complex Computer Systems IEEE; 2013. p. 222–231

work page 2013

[10] [10]

Fuchsia OS-A threat to Android

Singh T, Bhardwaj R. Fuchsia OS-A threat to Android. IITM Journal of Management and IT 2019;10(1):65–67

work page 2019

[11] [11]

http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ [last checked July 2020]

Höhrmann B, Flexible and Economical UTF-8 Decoder; 2010. http://bjoern.hoehrmann.de/utf-8/decoder/dfa/ [last checked July 2020]

work page 2010

[12] [12]

Parsing gigabytes of JSON per second

Langdale G, Lemire D. Parsing gigabytes of JSON per second. The VLDB Journal 2019;28(6):941–960

work page 2019

[13] [13]

Reducibility among combinatorial problems

Karp RM. Reducibility among combinatorial problems. In: Complexity of computer computations Springer; 1972.p. 85–103

work page 1972

[14] [14]

Scienti/uniFB01c benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results

Hoe/uniFB02er T, Belli R. Scienti/uniFB01c benchmarking of parallel computing systems: twelve ways to tell the masses when reporting performance results. In: Proceedings of the international conference for high performance computing, networking, storage and analysis; 2015. p. 1–12

work page 2015

[15] [15]

Faster Base64 encoding and decoding using AVX2 instructions

Muła W, Lemire D. Faster Base64 encoding and decoding using AVX2 instructions. ACM T ransactions on the Web 2018;12(3):1–26

work page 2018

[16] [16]

Base64 encoding and decoding at almost the speed of a memory copy

Muła W, Lemire D. Base64 encoding and decoding at almost the speed of a memory copy. Software: Practice and Experience 2020;50(2):89–97

work page 2020

[17] [17]

High Performance XML Parsing Using Parallel Bit Stream T echnology

Cameron RD, Herdy KS, Lin D. High Performance XML Parsing Using Parallel Bit Stream T echnology. In: Proceedings of the 2008 Conference of the Center for Advanced Studies on Collaborative Research: Meeting of Minds CASCON ’08, New Y ork, NY, USA: ACM; 2008. p. 17:222–17:235. John Keiser and Daniel Lemire 19

work page 2008

[18] [18]

Data-parallel Finite-state Machines

Mytkowicz T, Musuvathi M, Schulte W. Data-parallel Finite-state Machines. In: Proceedings of the 19th International Conference on Architectural Support for Programming Languages and Operating Systems ASPLOS ’14, New Y ork, NY, USA: ACM; 2014. p. 529–542

work page 2014

[19] [19]

Instant Loading for Main Memory Databases

Mühlbauer T, Rödiger W, Seilbeck R, Reiser A, Kemper A, Neumann T. Instant Loading for Main Memory Databases. Proc VLDB Endow 2013 Sep;6(14):1702–1713

work page 2013

[20] [20]

A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding

Cameron RD. A case study in SIMD text processing with parallel bit streams: UTF-8 to UTF-16 transcoding. In: Proceed- ings of the 13th ACM SIGPLAN Symposium on Principles and practice of parallel programming ACM; 2008. p. 91–98

work page 2008

[21] [21]

Combining SIMD and Many/Multi-core parallelism for /uniFB01nite state machines with enumerative spec- ulation

Jiang P , Agrawal G. Combining SIMD and Many/Multi-core parallelism for /uniFB01nite state machines with enumerative spec- ulation. In: Proceedings of the 22nd ACM SIGPLAN Symposium on Principles and Practice of Parallel Programming

work page

[22] [22]

The ARM scalable vector extension

Stephens N, Biles S, Boettcher M, Eapen J, Eyole M, Gabrielli G, et al. The ARM scalable vector extension. IEEE Micro 2017;37(2):26–39

work page 2017

[23] [23]

Vectorization cost modeling for NEON, AVX and SVE

Pohl A, Cosenza B, Juurlink B. Vectorization cost modeling for NEON, AVX and SVE. Performance Evaluation 2020;140– 141:102106

work page 2020