Converting an Integer to a Decimal String in Under Two Nanoseconds
Pith reviewed 2026-05-07 14:54 UTC · model grok-4.3
The pith
A SIMD algorithm converts integers to decimal strings without lookup tables, running 1.4-2x faster than prior methods.
A machine-rendered reading of the paper's core claim, the machinery that carries it, and where it could break.
Core claim
Their single-core algorithm eliminates lookup tables by leveraging integer multiply-add SIMD instructions to compute quotients and remainders in parallel, while a dynamic selector switches between branch-heavy and branch-light variants to match the input distribution; this produces 1.4-2x speed over the closest prior method and 2-4x over std::to_chars on tested workloads.
What carries the argument
SIMD integer multiply-add instructions that compute multiple quotients and remainders in parallel, paired with dynamic selection between branch-heavy and branch-light code paths.
If this is right
- The method works across the full range of integer sizes without tables.
- Performance holds for both homogeneous and heterogeneous digit-length inputs via adaptive selection.
- Single-core throughput exceeds all tested alternatives under the stated hardware conditions.
- No recursive division or small lookup tables are required.
Where Pith is reading between the lines
- Similar parallel quotient techniques could apply to other bases or to floating-point formatting.
- Library implementers might adopt the dual-variant selector to improve performance on modern CPUs.
- On older hardware without the required instructions, the gains would disappear and standard methods would be preferable.
- The approach highlights how hardware multiply-add units can replace software tables for fundamental conversions.
Load-bearing premise
The speed claims assume recent AMD and Intel processors provide the needed integer multiply-add SIMD instructions and that the dynamic variant selector works well for the tested input distributions.
What would settle it
Measure conversion times on a processor lacking the specific multiply-add SIMD instructions and check whether the new method falls below the performance of std::to_chars or the closest competitor.
read the original abstract
Converting binary integers to variable-length decimal strings is a fundamental operation in computing. Conventional fast approaches rely on recursive division and small lookup tables. We propose a SIMD-based algorithm that leverages integer multiply-add instructions available on recent AMD and Intel processors. Our method eliminates lookup tables entirely and computes multiple quotients and remainders in parallel. Additionally, we introduce a dual-variant design with dynamic selection that adapts to input characteristics: a branch-heavy variant optimized for homogeneous digit-length distributions and a branch-light variant for heterogeneous datasets. Our single-core algorithm consistently outperforms all competing methods across the full range of integer sizes, running 1.4-2x faster than the closest competitor and 2-4x faster than the C++ standard library function std::to_chars across tested workloads.
Editorial analysis
A structured set of objections, weighed in public.
Referee Report
Summary. The paper proposes a SIMD-based algorithm for converting binary integers to variable-length decimal strings. It leverages integer multiply-add instructions on recent AMD and Intel processors to compute multiple quotients and remainders in parallel without lookup tables. The method includes a dual-variant design with dynamic selection between a branch-heavy variant (for homogeneous digit-length inputs) and a branch-light variant (for heterogeneous inputs), claiming consistent single-core speedups of 1.4-2x over the closest competitor and 2-4x over std::to_chars across tested workloads and the full range of integer sizes.
Significance. If the performance claims are substantiated with rigorous benchmarks, the work would be a meaningful practical contribution to high-performance integer formatting, relevant to serialization, logging, and database workloads. The elimination of lookup tables via SIMD parallel arithmetic and the adaptive dual-variant strategy are explicit strengths that could influence optimized library implementations on x86 hardware.
major comments (2)
- Abstract: the central claim of 'consistent outperformance ... across the full range of integer sizes' and specific speedup factors (1.4-2x, 2-4x) is load-bearing for the paper's contribution, yet the abstract provides no description of benchmark methodology, input distributions, number of trials, hardware details beyond 'recent AMD and Intel', error handling, or verification that speedups hold without post-hoc selection of workloads.
- Abstract (algorithm description): the dynamic selection between branch-heavy and branch-light variants is presented as key to the results, but no concrete selection predicate, decision latency, or validation against misprediction costs on heterogeneous inputs is given; this is the least secure link because the headline numbers require that the selector correctly and cheaply identifies the faster path for the tested distributions.
minor comments (1)
- Abstract: the specific SIMD multiply-add instructions (e.g., which variants of VPMADD or equivalent) are not named, which would aid reproducibility even at the high-level description.
Simulated Author's Rebuttal
We thank the referee for the constructive review and positive assessment of the contribution. We address each major comment below, providing clarifications drawn from the manuscript and indicating where revisions will be made to strengthen the presentation.
read point-by-point responses
-
Referee: Abstract: the central claim of 'consistent outperformance ... across the full range of integer sizes' and specific speedup factors (1.4-2x, 2-4x) is load-bearing for the paper's contribution, yet the abstract provides no description of benchmark methodology, input distributions, number of trials, hardware details beyond 'recent AMD and Intel', error handling, or verification that speedups hold without post-hoc selection of workloads.
Authors: We agree that the abstract would be strengthened by a concise reference to the experimental context supporting the performance claims. In the revised version we will append a short clause to the abstract noting that the reported speedups were measured on recent Intel and AMD processors across homogeneous and heterogeneous digit-length distributions, averaged over large numbers of trials, and verified against reference implementations with no post-hoc workload selection. Complete benchmark methodology, hardware specifications, error handling, and statistical details remain in the Experiments section. This change directly addresses the concern while respecting abstract length limits. revision: yes
-
Referee: Abstract (algorithm description): the dynamic selection between branch-heavy and branch-light variants is presented as key to the results, but no concrete selection predicate, decision latency, or validation against misprediction costs on heterogeneous inputs is given; this is the least secure link because the headline numbers require that the selector correctly and cheaply identifies the faster path for the tested distributions.
Authors: The manuscript already specifies the selection predicate, its decision logic, and measured latency in the algorithm description section. We acknowledge that these particulars are absent from the abstract. We will revise the abstract to include a brief statement that the dual-variant selector uses a low-overhead heuristic based on input characteristics. In addition, we will expand the Experiments section with explicit branch-misprediction measurements and performance results on heterogeneous inputs to confirm that the selector reliably routes to the faster path. These additions will make the adaptive strategy's role in the reported speedups fully transparent. revision: partial
Circularity Check
No circularity detected; algorithmic derivation is self-contained
full rationale
The paper introduces a new SIMD algorithm leveraging specific multiply-add instructions and a dual-variant adaptive design for integer-to-decimal conversion. No equations, parameters, or claims reduce by construction to fitted inputs, self-definitions, or load-bearing self-citations. Performance results are presented as direct empirical measurements on hardware, not as predictions derived from the method itself. The derivation chain stands independently without any of the enumerated circular patterns.
Axiom & Free-Parameter Ledger
axioms (1)
- domain assumption Recent AMD and Intel processors provide integer multiply-add SIMD instructions usable for parallel division-by-10 operations.
Reference graph
Works this paper leans on
-
[1]
T ranscoding unicode characters with AVX-512 instructions
Clausecker R, Lemire D. T ranscoding unicode characters with AVX-512 instructions. Software: Practice and Experience 2023;53(12):2430–2462
2023
-
[2]
Base64 encoding and decoding at almost the speed of a memory copy
Muła W, Lemire D. Base64 encoding and decoding at almost the speed of a memory copy. Software: Practice and Experience 2020;50(2):89–97
2020
-
[3]
The C Programming Language
Kernighan BW, Ritchie DM. The C Programming Language. 2nd ed. Englewood Cliffs, NJ: Prentice-Hall; 1988
1988
-
[4]
Lemire D, Counting the digits of 64-bit integers; 2025.https://lemire.me/blog/2025/01/07/counting-the-digits- of-64-bit-integers/, accessed: 14 January 2026. 23
2025
-
[5]
Video recording available athttps://archive.org/ details/AndreiAlexandrescu-Three-Optimization-Tips
Alexandrescu A, Three Optimization Tips for C++ Code; 2012. Video recording available athttps://archive.org/ details/AndreiAlexandrescu-Three-Optimization-Tips. Associated blog post:https://www.facebook.com/notes/ 10158791579037200/Accessed: 14 January 2026
2012
-
[6]
Fog A, Instruction tables: Lists of instruction latencies, throughputs and micro-operation breakdowns for Intel, AMD and VIA CPUs; 2025.https://www.agner.org/optimize/instruction_tables.pdf, accessed: 14 January 2026
2025
-
[7]
A combinatoric division algorithm for fixed-integer divisors
Jacobsohn DH. A combinatoric division algorithm for fixed-integer divisors. IEEE T rans on Comput 1973;100(6):608– 610
1973
-
[8]
A Fast Division T echnique for Constant Divisors
Artzy E, Hinds JA, Saal HJ. A Fast Division T echnique for Constant Divisors. Commun ACM 1976 Feb;19(2):98–101
1976
-
[9]
Fast Constant Division Routines
Li SYR. Fast Constant Division Routines. IEEE T rans Comput 1985 Sep;34(9):866–869
1985
-
[10]
Integer Multiplication and Division on the HP Precision Architecture
Magenheimer DJ, Peters L, Pettis K, Zuras D. Integer Multiplication and Division on the HP Precision Architecture. SIGARCH Comput Archit News 1987 Oct;15(5):90–99.http://doi.acm.org/10.1145/36177.36189
-
[11]
N-Bit Unsigned Division via N-Bit Multiply-Add
Robison AD. N-Bit Unsigned Division via N-Bit Multiply-Add. In: Proceedings of the 17th IEEE Symposium on Computer Arithmetic ARITH ’05, Washington, DC, USA: IEEE Computer Society; 2005. p. 131–139
2005
-
[12]
Division by Invariant Integers Using Multiplication
Granlund T, Montgomery PL. Division by Invariant Integers Using Multiplication. SIGPLAN Not 1994 Jun;29(6):61–72
1994
-
[13]
Efficient Algorithms for Integer Division by Constants Using Multiplication
Cavagnino D, Werbrouck AE. Efficient Algorithms for Integer Division by Constants Using Multiplication. Comput J 2008 Jul;51(4):470–480
2008
-
[14]
Hacker’s Delight
Warren HS Jr. Hacker’s Delight. 2nd ed. Boston: Addison-Wesley; 2013
2013
-
[15]
Integer division by constants: optimal bounds
Lemire D, Bartlett C, Kaser O. Integer division by constants: optimal bounds. Heliyon 2021;7(6)
2021
-
[16]
SourceForge; 2007.https://sourceforge.net/projects/ itoa/, GPL-2.0 licensed project for fast integer-to-string conversion, Accessed: 14 January 2026
Chateauneu R, Good old Integer T o ASCII conversion (itoa). SourceForge; 2007.https://sourceforge.net/projects/ itoa/, GPL-2.0 licensed project for fast integer-to-string conversion, Accessed: 14 January 2026
2007
-
[17]
Henriksen IE, Answer to “optimized itoa function”; 2013.https://stackoverflow.com/a/15461080/5183410, answer embeds code that attributes the optimization to Inge Eivind Henriksen (2013) and the original itoa to James Clark/groff. Accessed: 14 January 2026
-
[18]
GitHub repository
fmtlib contributors, format-benchmark:u2985907implementation; 2020.https://github.com/fmtlib/format- benchmark/blob/master/src/u2985907.h, accessed: 14 January 2026. GitHub repository
2020
-
[19]
C++ performance challenge: integer to std::string conversion
Hopman C, Answer to “C++ performance challenge: integer to std::string conversion”; 2010.https://stackoverflow. com/a/4364057, StackOverflow answer describing the ‘hopman_fun’ (arithmetic) and ‘hopman_fast’ (104 lookup) variants
-
[20]
Message posted tocomp.lang.asm.x86news- group
Mathisen T, Fast integer to ASCII conversion code; 1999.https://groups.google.com/g/comp.lang.asm.x86/c/ rBzgsQdkq8c/m/JCv8EkPgCS4J, usenet post, Accessed: 14 January 2026. Message posted tocomp.lang.asm.x86news- group
1999
-
[21]
Message posted tocomp.lang.asm.x86newsgroup
Mathisen T, Converting numbers to text; 2015.https://comp.lang.asm.x86.narkive.com/hIuZMYCO/converting- numbers-to-text#post2, usenet post, Accessed: 14 January 2026. Message posted tocomp.lang.asm.x86newsgroup
2015
-
[22]
Khuong P , How to print integers really fast (with Open Source AppNexus code!); 2017.https://pvk.ca/Blog/2017/12/ 22/appnexus-common-framework-its-out-also-how-to-print-integers-faster/, accessed: 14 January 2026
2017
-
[23]
html, accessed: 14 January 2026
Muła W, SSE: conversion integers to decimal representation; 2011.http://0x80.pl/notesen/2011-10-21-sse-itoa. html, accessed: 14 January 2026
2011
-
[24]
Usericecreamsword, Answer to “optimized itoa function”; 2015.https://stackoverflow.com/a/32818030/5183410, StackOverflow answer describing a SIMD variant of Mathisen’s algorithm, Accessed: 14 January 2026. 24
-
[25]
Ry ¯u: Fast Float-to-String Conversion
Adams U. Ry ¯u: Fast Float-to-String Conversion. In: Proceedings of the 39th ACM SIGPLAN Conference on Programming Language Design and Implementation PLDI 2018, New Y ork, NY, USA: Association for Computing Machinery; 2018. p. 270–282
2018
-
[26]
Faster remainder by direct computation: Applications to compilers and software libraries
Lemire D, Kaser O, Kurz N. Faster remainder by direct computation: Applications to compilers and software libraries. Software: Practice and Experience 2019;49(6):953–970. 25 A|LIST OF ACRONYMS Acronym Full form ASCII American Standard Code for Information Interchange AVX Advanced Vector Extensions AVX-512 Advanced Vector Extensions 512-bit AVX-512BW AVX-5...
2019
discussion (0)
Sign in with ORCID, Apple, or X to comment. Anyone can read and Pith papers without signing in.