Converting an Integer to a Decimal String in Under Two Nanoseconds

Daniel Lemire; Ja\"el Champagne Gareau

arxiv: 2604.26019 · v1 · submitted 2026-04-28 · 💻 cs.DS

Converting an Integer to a Decimal String in Under Two Nanoseconds

Ja\"el Champagne Gareau , Daniel Lemire This is my paper

Pith reviewed 2026-05-07 14:54 UTC · model grok-4.3

classification 💻 cs.DS

keywords integer to string conversionSIMD optimizationdecimal string generationperformance optimizationmultiply-add instructionsdual-variant algorithminteger formatting

0 comments

The pith

A SIMD algorithm converts integers to decimal strings without lookup tables, running 1.4-2x faster than prior methods.

A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.

The paper introduces a SIMD-based method that uses integer multiply-add instructions on recent AMD and Intel processors to convert binary integers to variable-length decimal strings. It computes multiple quotients and remainders in parallel and removes reliance on lookup tables. A dual-variant design dynamically selects between a branch-heavy variant for uniform digit lengths and a branch-light variant for mixed inputs. If correct, this yields consistent speedups of 1.4-2 times over the nearest competitor and 2-4 times over the C++ standard library across all integer sizes.

Core claim

Their single-core algorithm eliminates lookup tables by leveraging integer multiply-add SIMD instructions to compute quotients and remainders in parallel, while a dynamic selector switches between branch-heavy and branch-light variants to match the input distribution; this produces 1.4-2x speed over the closest prior method and 2-4x over std::to_chars on tested workloads.

What carries the argument

SIMD integer multiply-add instructions that compute multiple quotients and remainders in parallel, paired with dynamic selection between branch-heavy and branch-light code paths.

If this is right

The method works across the full range of integer sizes without tables.
Performance holds for both homogeneous and heterogeneous digit-length inputs via adaptive selection.
Single-core throughput exceeds all tested alternatives under the stated hardware conditions.
No recursive division or small lookup tables are required.

Where Pith is reading between the lines

These are editorial extensions of the paper, not claims the author makes directly.

Similar parallel quotient techniques could apply to other bases or to floating-point formatting.
Library implementers might adopt the dual-variant selector to improve performance on modern CPUs.
On older hardware without the required instructions, the gains would disappear and standard methods would be preferable.
The approach highlights how hardware multiply-add units can replace software tables for fundamental conversions.

Load-bearing premise

The speed claims assume recent AMD and Intel processors provide the needed integer multiply-add SIMD instructions and that the dynamic variant selector works well for the tested input distributions.

What would settle it

Measure conversion times on a processor lacking the specific multiply-add SIMD instructions and check whether the new method falls below the performance of std::to_chars or the closest competitor.

read the original abstract

Converting binary integers to variable-length decimal strings is a fundamental operation in computing. Conventional fast approaches rely on recursive division and small lookup tables. We propose a SIMD-based algorithm that leverages integer multiply-add instructions available on recent AMD and Intel processors. Our method eliminates lookup tables entirely and computes multiple quotients and remainders in parallel. Additionally, we introduce a dual-variant design with dynamic selection that adapts to input characteristics: a branch-heavy variant optimized for homogeneous digit-length distributions and a branch-light variant for heterogeneous datasets. Our single-core algorithm consistently outperforms all competing methods across the full range of integer sizes, running 1.4-2x faster than the closest competitor and 2-4x faster than the C++ standard library function std::to_chars across tested workloads.

Editorial analysis

A structured set of objections, weighed in public.

Desk editor's note, referee report, simulated authors' rebuttal, and a circularity audit. Tearing a paper down is the easy half of reading it; the pith above is the substance, this is the friction.

Desk Editor's Note private letter to a colleague

A hardware-specific SIMD method for integer-to-decimal conversion that removes lookup tables and adds runtime adaptation between two code paths.

read the letter

The paper's core contribution is a table-free algorithm that uses SIMD integer multiply-add instructions on recent AMD and Intel CPUs to compute multiple quotients and remainders in parallel, plus a dynamic switch between a branch-heavy variant for uniform digit lengths and a branch-light variant for mixed cases. This setup is what lets it claim 1.4-2x gains over prior fast methods and 2-4x over std::to_chars on single-core workloads. The approach is genuinely new in how it combines the parallel arithmetic with the adaptive selection rather than relying on the usual recursive division plus small tables. It performs well on the reported tests by delivering consistent speedups across integer sizes without the memory access costs of tables. The main soft spots are the dependence on specific SIMD instructions that are not available everywhere, and the need for the dynamic selector to add negligible overhead while correctly picking the faster path on real input distributions. If the selector mispredicts often or the branch-light path carries hidden costs, the headline numbers would not hold up. The paper should also spell out the exact benchmark inputs and verification steps to rule out selection effects. This work is for performance engineers and library developers who care about string formatting throughput in servers or databases. A reader already working on low-level serialization would find usable techniques here. It deserves peer review because the idea is concrete and the results are measurable, even though revisions for portability and fuller benchmark documentation would strengthen it.

Referee Report

2 major / 1 minor

Summary. The paper proposes a SIMD-based algorithm for converting binary integers to variable-length decimal strings. It leverages integer multiply-add instructions on recent AMD and Intel processors to compute multiple quotients and remainders in parallel without lookup tables. The method includes a dual-variant design with dynamic selection between a branch-heavy variant (for homogeneous digit-length inputs) and a branch-light variant (for heterogeneous inputs), claiming consistent single-core speedups of 1.4-2x over the closest competitor and 2-4x over std::to_chars across tested workloads and the full range of integer sizes.

Significance. If the performance claims are substantiated with rigorous benchmarks, the work would be a meaningful practical contribution to high-performance integer formatting, relevant to serialization, logging, and database workloads. The elimination of lookup tables via SIMD parallel arithmetic and the adaptive dual-variant strategy are explicit strengths that could influence optimized library implementations on x86 hardware.

major comments (2)

Abstract: the central claim of 'consistent outperformance ... across the full range of integer sizes' and specific speedup factors (1.4-2x, 2-4x) is load-bearing for the paper's contribution, yet the abstract provides no description of benchmark methodology, input distributions, number of trials, hardware details beyond 'recent AMD and Intel', error handling, or verification that speedups hold without post-hoc selection of workloads.
Abstract (algorithm description): the dynamic selection between branch-heavy and branch-light variants is presented as key to the results, but no concrete selection predicate, decision latency, or validation against misprediction costs on heterogeneous inputs is given; this is the least secure link because the headline numbers require that the selector correctly and cheaply identifies the faster path for the tested distributions.

minor comments (1)

Abstract: the specific SIMD multiply-add instructions (e.g., which variants of VPMADD or equivalent) are not named, which would aid reproducibility even at the high-level description.

Simulated Author's Rebuttal

2 responses · 0 unresolved

We thank the referee for the constructive review and positive assessment of the contribution. We address each major comment below, providing clarifications drawn from the manuscript and indicating where revisions will be made to strengthen the presentation.

read point-by-point responses

Referee: Abstract: the central claim of 'consistent outperformance ... across the full range of integer sizes' and specific speedup factors (1.4-2x, 2-4x) is load-bearing for the paper's contribution, yet the abstract provides no description of benchmark methodology, input distributions, number of trials, hardware details beyond 'recent AMD and Intel', error handling, or verification that speedups hold without post-hoc selection of workloads.

Authors: We agree that the abstract would be strengthened by a concise reference to the experimental context supporting the performance claims. In the revised version we will append a short clause to the abstract noting that the reported speedups were measured on recent Intel and AMD processors across homogeneous and heterogeneous digit-length distributions, averaged over large numbers of trials, and verified against reference implementations with no post-hoc workload selection. Complete benchmark methodology, hardware specifications, error handling, and statistical details remain in the Experiments section. This change directly addresses the concern while respecting abstract length limits. revision: yes
Referee: Abstract (algorithm description): the dynamic selection between branch-heavy and branch-light variants is presented as key to the results, but no concrete selection predicate, decision latency, or validation against misprediction costs on heterogeneous inputs is given; this is the least secure link because the headline numbers require that the selector correctly and cheaply identifies the faster path for the tested distributions.

Authors: The manuscript already specifies the selection predicate, its decision logic, and measured latency in the algorithm description section. We acknowledge that these particulars are absent from the abstract. We will revise the abstract to include a brief statement that the dual-variant selector uses a low-overhead heuristic based on input characteristics. In addition, we will expand the Experiments section with explicit branch-misprediction measurements and performance results on heterogeneous inputs to confirm that the selector reliably routes to the faster path. These additions will make the adaptive strategy's role in the reported speedups fully transparent. revision: partial

Circularity Check

0 steps flagged

No circularity detected; algorithmic derivation is self-contained

full rationale

The paper introduces a new SIMD algorithm leveraging specific multiply-add instructions and a dual-variant adaptive design for integer-to-decimal conversion. No equations, parameters, or claims reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations. Performance results are presented as direct empirical measurements on hardware, not as predictions derived from the method itself. The derivation chain stands independently without any of the enumerated circular patterns.

Axiom & Free-Parameter Ledger

0 free parameters · 1 axioms · 0 invented entities

The central claim depends on hardware support for specific SIMD instructions and on the input distributions matching the adaptive logic; these are standard domain assumptions rather than new postulates.

axioms (1)

domain assumption Recent AMD and Intel processors provide integer multiply-add SIMD instructions usable for parallel division-by-10 operations.
The algorithm is built directly on these instructions as described in the abstract.

pith-pipeline@v0.9.0 · 5424 in / 1189 out tokens · 66339 ms · 2026-05-07T14:54:13.460181+00:00 · methodology

discussion (0)

Reference graph

Works this paper leans on

26 extracted references · 4 canonical work pages

[1]

T ranscoding unicode characters with AVX-512 instructions

Clausecker R, Lemire D. T ranscoding unicode characters with AVX-512 instructions. Software: Practice and Experience 2023;53(12):2430–2462

2023
[2]

Base64 encoding and decoding at almost the speed of a memory copy

Muła W, Lemire D. Base64 encoding and decoding at almost the speed of a memory copy. Software: Practice and Experience 2020;50(2):89–97

2020
[3]

The C Programming Language

Kernighan BW, Ritchie DM. The C Programming Language. 2nd ed. Englewood Cliﬀs, NJ: Prentice-Hall; 1988

1988
[4]

Lemire D, Counting the digits of 64-bit integers; 2025.https://lemire.me/blog/2025/01/07/counting-the-digits- of-64-bit-integers/, accessed: 14 January 2026. 23

2025
[5]

Video recording available athttps://archive.org/ details/AndreiAlexandrescu-Three-Optimization-Tips

Alexandrescu A, Three Optimization Tips for C++ Code; 2012. Video recording available athttps://archive.org/ details/AndreiAlexandrescu-Three-Optimization-Tips. Associated blog post:https://www.facebook.com/notes/ 10158791579037200/Accessed: 14 January 2026

2012
[6]

Fog A, Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs; 2025.https://www.agner.org/optimize/instruction_tables.pdf, accessed: 14 January 2026

2025
[7]

A combinatoric division algorithm for ﬁxed-integer divisors

Jacobsohn DH. A combinatoric division algorithm for ﬁxed-integer divisors. IEEE T rans on Comput 1973;100(6):608– 610

1973
[8]

A Fast Division T echnique for Constant Divisors

Artzy E, Hinds JA, Saal HJ. A Fast Division T echnique for Constant Divisors. Commun ACM 1976 Feb;19(2):98–101

1976
[9]

Fast Constant Division Routines

Li SYR. Fast Constant Division Routines. IEEE T rans Comput 1985 Sep;34(9):866–869

1985
[10]

Integer Multiplication and Division on the HP Precision Architecture

Magenheimer DJ, Peters L, Pettis K, Zuras D. Integer Multiplication and Division on the HP Precision Architecture. SIGARCH Comput Archit News 1987 Oct;15(5):90–99.http://doi.acm.org/10.1145/36177.36189

work page doi:10.1145/36177.36189 1987
[11]

N-Bit Unsigned Division via N-Bit Multiply-Add

Robison AD. N-Bit Unsigned Division via N-Bit Multiply-Add. In: Proceedings of the 17th IEEE Symposium on Computer Arithmetic ARITH ’05, Washington, DC, USA: IEEE Computer Society; 2005. p. 131–139

2005
[12]

Division by Invariant Integers Using Multiplication

Granlund T, Montgomery PL. Division by Invariant Integers Using Multiplication. SIGPLAN Not 1994 Jun;29(6):61–72

1994
[13]

Eﬃcient Algorithms for Integer Division by Constants Using Multiplication

Cavagnino D, Werbrouck AE. Eﬃcient Algorithms for Integer Division by Constants Using Multiplication. Comput J 2008 Jul;51(4):470–480

2008
[14]

Hacker’s Delight

Warren HS Jr. Hacker’s Delight. 2nd ed. Boston: Addison-Wesley; 2013

2013
[15]

Integer division by constants: optimal bounds

Lemire D, Bartlett C, Kaser O. Integer division by constants: optimal bounds. Heliyon 2021;7(6)

2021
[16]

SourceForge; 2007.https://sourceforge.net/projects/ itoa/, GPL-2.0 licensed project for fast integer-to-string conversion, Accessed: 14 January 2026

Chateauneu R, Good old Integer T o ASCII conversion (itoa). SourceForge; 2007.https://sourceforge.net/projects/ itoa/, GPL-2.0 licensed project for fast integer-to-string conversion, Accessed: 14 January 2026

2007
[17]

optimized itoa function

Henriksen IE, Answer to “optimized itoa function”; 2013.https://stackoverflow.com/a/15461080/5183410, answer embeds code that attributes the optimization to Inge Eivind Henriksen (2013) and the original itoa to James Clark/groﬀ. Accessed: 14 January 2026

work page arXiv 2013
[18]

GitHub repository

fmtlib contributors, format-benchmark:u2985907implementation; 2020.https://github.com/fmtlib/format- benchmark/blob/master/src/u2985907.h, accessed: 14 January 2026. GitHub repository

2020
[19]

C++ performance challenge: integer to std::string conversion

Hopman C, Answer to “C++ performance challenge: integer to std::string conversion”; 2010.https://stackoverflow. com/a/4364057, StackOverﬂow answer describing the ‘hopman_fun’ (arithmetic) and ‘hopman_fast’ (104 lookup) variants

work page arXiv 2010
[20]

Message posted tocomp.lang.asm.x86news- group

Mathisen T, Fast integer to ASCII conversion code; 1999.https://groups.google.com/g/comp.lang.asm.x86/c/ rBzgsQdkq8c/m/JCv8EkPgCS4J, usenet post, Accessed: 14 January 2026. Message posted tocomp.lang.asm.x86news- group

1999
[21]

Message posted tocomp.lang.asm.x86newsgroup

Mathisen T, Converting numbers to text; 2015.https://comp.lang.asm.x86.narkive.com/hIuZMYCO/converting- numbers-to-text#post2, usenet post, Accessed: 14 January 2026. Message posted tocomp.lang.asm.x86newsgroup

2015
[22]

Khuong P , How to print integers really fast (with Open Source AppNexus code!); 2017.https://pvk.ca/Blog/2017/12/ 22/appnexus-common-framework-its-out-also-how-to-print-integers-faster/, accessed: 14 January 2026

2017
[23]

html, accessed: 14 January 2026

Muła W, SSE: conversion integers to decimal representation; 2011.http://0x80.pl/notesen/2011-10-21-sse-itoa. html, accessed: 14 January 2026

2011
[24]

optimized itoa function

Usericecreamsword, Answer to “optimized itoa function”; 2015.https://stackoverflow.com/a/32818030/5183410, StackOverﬂow answer describing a SIMD variant of Mathisen’s algorithm, Accessed: 14 January 2026. 24

work page arXiv 2015
[25]

Ry ¯u: Fast Float-to-String Conversion

Adams U. Ry ¯u: Fast Float-to-String Conversion. In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation PLDI 2018, New Y ork, NY, USA: Association for Computing Machinery; 2018. p. 270–282

2018
[26]

Faster remainder by direct computation: Applications to compilers and software libraries

Lemire D, Kaser O, Kurz N. Faster remainder by direct computation: Applications to compilers and software libraries. Software: Practice and Experience 2019;49(6):953–970. 25 A|LIST OF ACRONYMS Acronym Full form ASCII American Standard Code for Information Interchange AVX Advanced Vector Extensions AVX-512 Advanced Vector Extensions 512-bit AVX-512BW AVX-5...

2019

[1] [1]

T ranscoding unicode characters with AVX-512 instructions

Clausecker R, Lemire D. T ranscoding unicode characters with AVX-512 instructions. Software: Practice and Experience 2023;53(12):2430–2462

2023

[2] [2]

Base64 encoding and decoding at almost the speed of a memory copy

Muła W, Lemire D. Base64 encoding and decoding at almost the speed of a memory copy. Software: Practice and Experience 2020;50(2):89–97

2020

[3] [3]

The C Programming Language

Kernighan BW, Ritchie DM. The C Programming Language. 2nd ed. Englewood Cliﬀs, NJ: Prentice-Hall; 1988

1988

[4] [4]

Lemire D, Counting the digits of 64-bit integers; 2025.https://lemire.me/blog/2025/01/07/counting-the-digits- of-64-bit-integers/, accessed: 14 January 2026. 23

2025

[5] [5]

Video recording available athttps://archive.org/ details/AndreiAlexandrescu-Three-Optimization-Tips

Alexandrescu A, Three Optimization Tips for C++ Code; 2012. Video recording available athttps://archive.org/ details/AndreiAlexandrescu-Three-Optimization-Tips. Associated blog post:https://www.facebook.com/notes/ 10158791579037200/Accessed: 14 January 2026

2012

[6] [6]

Fog A, Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs; 2025.https://www.agner.org/optimize/instruction_tables.pdf, accessed: 14 January 2026

2025

[7] [7]

A combinatoric division algorithm for ﬁxed-integer divisors

Jacobsohn DH. A combinatoric division algorithm for ﬁxed-integer divisors. IEEE T rans on Comput 1973;100(6):608– 610

1973

[8] [8]

A Fast Division T echnique for Constant Divisors

Artzy E, Hinds JA, Saal HJ. A Fast Division T echnique for Constant Divisors. Commun ACM 1976 Feb;19(2):98–101

1976

[9] [9]

Fast Constant Division Routines

Li SYR. Fast Constant Division Routines. IEEE T rans Comput 1985 Sep;34(9):866–869

1985

[10] [10]

Integer Multiplication and Division on the HP Precision Architecture

Magenheimer DJ, Peters L, Pettis K, Zuras D. Integer Multiplication and Division on the HP Precision Architecture. SIGARCH Comput Archit News 1987 Oct;15(5):90–99.http://doi.acm.org/10.1145/36177.36189

work page doi:10.1145/36177.36189 1987

[11] [11]

N-Bit Unsigned Division via N-Bit Multiply-Add

Robison AD. N-Bit Unsigned Division via N-Bit Multiply-Add. In: Proceedings of the 17th IEEE Symposium on Computer Arithmetic ARITH ’05, Washington, DC, USA: IEEE Computer Society; 2005. p. 131–139

2005

[12] [12]

Division by Invariant Integers Using Multiplication

Granlund T, Montgomery PL. Division by Invariant Integers Using Multiplication. SIGPLAN Not 1994 Jun;29(6):61–72

1994

[13] [13]

Eﬃcient Algorithms for Integer Division by Constants Using Multiplication

Cavagnino D, Werbrouck AE. Eﬃcient Algorithms for Integer Division by Constants Using Multiplication. Comput J 2008 Jul;51(4):470–480

2008

[14] [14]

Hacker’s Delight

Warren HS Jr. Hacker’s Delight. 2nd ed. Boston: Addison-Wesley; 2013

2013

[15] [15]

Integer division by constants: optimal bounds

Lemire D, Bartlett C, Kaser O. Integer division by constants: optimal bounds. Heliyon 2021;7(6)

2021

[16] [16]

SourceForge; 2007.https://sourceforge.net/projects/ itoa/, GPL-2.0 licensed project for fast integer-to-string conversion, Accessed: 14 January 2026

Chateauneu R, Good old Integer T o ASCII conversion (itoa). SourceForge; 2007.https://sourceforge.net/projects/ itoa/, GPL-2.0 licensed project for fast integer-to-string conversion, Accessed: 14 January 2026

2007

[17] [17]

optimized itoa function

Henriksen IE, Answer to “optimized itoa function”; 2013.https://stackoverflow.com/a/15461080/5183410, answer embeds code that attributes the optimization to Inge Eivind Henriksen (2013) and the original itoa to James Clark/groﬀ. Accessed: 14 January 2026

work page arXiv 2013

[18] [18]

GitHub repository

fmtlib contributors, format-benchmark:u2985907implementation; 2020.https://github.com/fmtlib/format- benchmark/blob/master/src/u2985907.h, accessed: 14 January 2026. GitHub repository

2020

[19] [19]

C++ performance challenge: integer to std::string conversion

Hopman C, Answer to “C++ performance challenge: integer to std::string conversion”; 2010.https://stackoverflow. com/a/4364057, StackOverﬂow answer describing the ‘hopman_fun’ (arithmetic) and ‘hopman_fast’ (104 lookup) variants

work page arXiv 2010

[20] [20]

Message posted tocomp.lang.asm.x86news- group

Mathisen T, Fast integer to ASCII conversion code; 1999.https://groups.google.com/g/comp.lang.asm.x86/c/ rBzgsQdkq8c/m/JCv8EkPgCS4J, usenet post, Accessed: 14 January 2026. Message posted tocomp.lang.asm.x86news- group

1999

[21] [21]

Message posted tocomp.lang.asm.x86newsgroup

Mathisen T, Converting numbers to text; 2015.https://comp.lang.asm.x86.narkive.com/hIuZMYCO/converting- numbers-to-text#post2, usenet post, Accessed: 14 January 2026. Message posted tocomp.lang.asm.x86newsgroup

2015

[22] [22]

Khuong P , How to print integers really fast (with Open Source AppNexus code!); 2017.https://pvk.ca/Blog/2017/12/ 22/appnexus-common-framework-its-out-also-how-to-print-integers-faster/, accessed: 14 January 2026

2017

[23] [23]

html, accessed: 14 January 2026

Muła W, SSE: conversion integers to decimal representation; 2011.http://0x80.pl/notesen/2011-10-21-sse-itoa. html, accessed: 14 January 2026

2011

[24] [24]

optimized itoa function

Usericecreamsword, Answer to “optimized itoa function”; 2015.https://stackoverflow.com/a/32818030/5183410, StackOverﬂow answer describing a SIMD variant of Mathisen’s algorithm, Accessed: 14 January 2026. 24

work page arXiv 2015

[25] [25]

Ry ¯u: Fast Float-to-String Conversion

Adams U. Ry ¯u: Fast Float-to-String Conversion. In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation PLDI 2018, New Y ork, NY, USA: Association for Computing Machinery; 2018. p. 270–282

2018

[26] [26]

Faster remainder by direct computation: Applications to compilers and software libraries

Lemire D, Kaser O, Kurz N. Faster remainder by direct computation: Applications to compilers and software libraries. Software: Practice and Experience 2019;49(6):953–970. 25 A|LIST OF ACRONYMS Acronym Full form ASCII American Standard Code for Information Interchange AVX Advanced Vector Extensions AVX-512 Advanced Vector Extensions 512-bit AVX-512BW AVX-5...

2019